You've Got (Junk) Email

Previous Home Next

A few days ago I was grousing about how the Artemis email list was receiving a glut of unsolicited, unwanted and, in several cases, downright offensive, email. I've become largely inured to filtering out the junk email, or spam as it's called, from my daily incoming email. It takes me a few seconds each time I update my email INBOX to scan the new messages and delete the junk. If you read your email as often as I did while I was chair of the department, you hardly notice it. However, if you're away from your email for a few days, it's not unusual to return to hundreds of email messages most of which are spam and it is annoying to have to filter through the deluge, separating the messages you want to read from the advertisements, solicitations and other unwanted and often virus-ridden junk email.

The title for this journal entry is a variation on "You've Got Mail", the title of a film starring Tom Hanks and Meg Ryan and directed by Nora Ephron about two people who start a relationship by exchanging email. Nowadays the sweet solicitations of an aspiring cyber suitor are as likely as not to get mistaken for yet another unwanted advertisement from an on-line pornography peddler. By the way, I haven't seen the film and I have no recommendations to offer one way or the other. I know of the film's title simply because it received a lot of press for it's treatment of a growing cultural phenomenon encompassing cyber dating, internet matchmaking and the explosion in electronically-mediated interpersonal relationships and services offering to facilitate such relationships.

In any case, I've been fretting about whether to write any more entries about artificial intelligence. As I mentioned in the July 11, 2002 entry, a professional lepidopterist might not be the best guide for a light-hearted romp through the fields searching for butterflies. Many of the topics studied in artificial intelligence are esoteric and, I gather from experience in bring them up in conversation, somewhat dry and academic. Learning or rather machine learning is, however, a good topic (the very idea!) for party conversations as long as you don't delve into the theory or get too wrapped up in the philosophical implications. Machine learning is about getting machines to learn stuff - acquire skills, adapt to change, generalize from observations and, more generally, improve their performance based on experience. Effective machine learning techniques already exist and add value in a wide range of applications from detecting credit-card fraud to making your car run better. But none of the applications I had come up with struck me as terribly good fodder for a journal entry until not one but two of my web-wise peripatetic1 colleagues pointed me to Paul Graham's Plan for Spam web page.

Filtering spam is the killer application for a new programming language, a dialect of Lisp called Arc, that Graham is working on. The ideal spam filter is a program that is seamlessly integrated into an email application (called an email client alluding to its relationship with an email server) and that quickly and unobtrusively learns your preferences and eliminates all and only unwanted email messages before you even see them. Small startups are developing filters that can be added to existing clients and companies that develop and market email clients are adding value to their products by incorporating filters into their designs. I'm intrigued with Graham's design but today I'm primarily interesting in the problem and its connection to machine learning.

You learn to filter spam by looking at a large number of email messages. It's the same type of problem as the one we considered in the August 8, 2002 journal entry; then we were interested in distinguishing arctic char from atlantic salmon and now we want to distinguish spam from non-spam. In the case of sorting fish, it is usually necessary to have someone provide a training set by pointing out examples of each type of fish. In the case of filtering spam, you and I can provide the training examples. Indeed, since each user's notion of spam is different (you may want to be alerted every time a new album by Destiny's Child is released but I'd just as soon not be bothered), it's important that a spam filter somehow learn a user's preferences. The challenge is learning these preferences without requiring the user to do too much work since the whole idea is to save you time and effort.

One the most studied areas of machine learning is concerned with supervised learning, learning in which the learning program is given a set of clearly marked training examples. In a supervised learning version of spam filtering, the user might mark a large number of messages as either spam or non-spam and the learning program would use these examples to learn to make distinctions on its own. A spam filter relying on supervised learning might require a fair amount of effort, e.g., marking several thousand messages, to achieve acceptable performance and a prospective user of such a program may not want to take the trouble. An alternative to making a single user supply all of the training examples, is to collect examples from many users whose preferences are similar and use these examples to train a spam filter. This is an example of collaborative filtering which identifies patterns in user preferences in order to guide users in finding goods and services (and sellers in marketing those goods and services to potential buyers). You see collaborative filtering at work when you visit an on-line seller and, in response to a query about a particular product, you see something like, "Customers who bought this book [or CD or DVD or video game] also bought:" followed by a list of similar products.

There are all sorts of approaches to filtering spam but most of them require a solution to a supervised learning problem. Today I just want to focus on learning algorithms that take as input a set of training examples, messages marked spam or non-spam in our case, and provide as output a spam classifier which, when applied to a new message, returns a classification of spam or non-spam. I also want to say a little about what besides machine learning is involved in building a spam filter and, incidently, why Paul Graham considers the problem as motivation for yet another programming language. Like several of the other topics in this journal this one gives me an excuse to do some hacking.

Most email clients keep your email messages in a file which is typically called an INBOX. There are all sorts of email protocols, e.g., POP and IMAP, and all sorts of clients and servers that implement these protocols but sooner or later a spam filter is going to have to grub around in an INBOX file full of messages in some format or possibly in multiple formats. Message formats vary in how they deal with headers, e.g., the contents of the "To:", "From:" and "Subject:" fields that you see at the top of every message, message content, e.g., allowing embedded HTML or other text formatting, and attachments, e.g., how binary files corresponding to images and documents are encoded, compressed and attached to the message. When it comes to making sense of the content of electronic files, computer scientists have a host of powerful tools at their disposal.

The first thing you might want to do with an INBOX is to identify the pieces of the file that are relevant to your application, spam filtering in this case. A lexical analyzer or lexer scans an input stream such as that produced by reading a file one character at a time and applies a collection of rules to convert the stream into a sequence of meaningful tokens. Each rule specifies a pattern and an action. As the lexer scans the input stream, if it sees a group of characters that match the pattern for a given rule, then the lexer performs the action associated with the rule. For example, a set of lexer rules for making sense of an INBOX might include a rule that looks for the sequence of characters, "From:", and whenever it finds such a sequence converts it into a FROM token. Other tokens might include the following: BEGIN-MESSAGE, END-MESSAGE, ADDRESS, SUBJECT, URL, etc. In some cases, the tokens take the place of pieces of the input stream, e.g., the string "From:" is replaced by the token (FROM) as is the string "FROM:" (it's irrelevant whether the characters are upper or lower case). In other cases, the tokens are attached to actual pieces of the input stream, e.g., tld@cs.brown.edu is replaced by (ADDRESS "tld@cs.brown.edu"). A lexer typically strips away irrelevant information such as spaces, tabs and carriage returns and emphasizes the meaningful content. Consider the following example email message (MIME stands for Multipurpose Internet Mail Extensions and is a widely used standard for encoding binary files as email attachments).

From: "Sulee Jenerika" <ejls@cs.brown.edu>
To: "Tom Dean" <tld@cs.brown.edu>
Subject: Re: Artemis Brochure

Hi Tom, 

We've got a revised draft for the Artemis 2002 
Brochure. Take a look and tell us what you think.

Sulee

[MIME-ATTACHMENT ...]

The lexical analyzer turns the above message into the following sequence of tokens: (BEGIN-MESSAGE), (BEGIN-HEADER) (FROM), (WORD "Sulee"), (WORD "Jenerika"), (ADDRESS "ejls@cs.brown.edu"), (TO), (WORD "Tom"), (WORD "Dean"), (ADDRESS "tld@cs.brown.edu"), (SUBJECT), (WORD "RE:"), (WORD "Artemis"), (WORD "Brochure"), (END-HEADER), (BEGIN-BODY), (WORD "Hi"), (WORD "Tom"), ..., (BEGIN-MIME-ATTACH), ..., (END-MIME-ATTACH), (END-BODY), (END-MESSAGE).

There's more structure in an INBOX file than is immediately evident in a sequence of tokens. For example, in the above example we see the following sub sequence: (FROM), (WORD "Sulee"), (WORD "Jenerika"), (ADDRESS "ejls@cs.brown.edu") in the output from the lexical analyzer. This sub sequence is analogous to a prepositional phrase indicating that the enclosing message is from (the preposition), someone identified by the supplied words (the proper noun serving as the object of the preposition) and described by the given address (an adjective of sorts further qualifying the object of the prepositional phrase). The structure in email messages and INBOX files can be described by a grammar consisting of grammatical rules not unlike those you may have learned in grade school, e.g., a prepositional phrase consists of a preposition and a noun phrase. However, unlike the rules of English grammar, which we're told by grammarians and style mavens alike serve only as guidelines, the grammar of messages and INBOX files is rather strictly adhered to and this makes it relatively easy to extract their content. A parser takes the output of a lexical analyzer and applies a set of grammatical rules in order to transform a sequence of tokens into a structured object capturing the content and structure of the original input stream. Continuing with our example, the parser turns the token sequence generated by the lexical analyzer above into the following structured object.

(MESSAGE 
 (SENDER (NAME "Sulee Jenerika")
         (ADDRESS "ejls@cs.brown.edu"))
 (RECIPIENT (NAME "Tom Dean")
            (ADDRESS "tld@cs.brown.edu"))
 (SUBJECT "Re: Artemis Brochure")
 (BODY (TEXT ("Hi" "Tom" ...))
       (MIME-ATTACHMENT (FORMAT PDF)
                        (COMPRESSION ZIP)
                        (DOCUMENT ...))
       ...))

It may not seem like we gained much but the above format makes it much easy for a program to extract the pieces of an email message relevant for spam filtering. I used a lexical analyzer called lex and a parser called yacc (which stands for "Yet Another Compiler Compiler") to convert my INBOX (actually several INBOX files) into a set of structured objects that I could then feed into various learning programs that I happen to have on hand written in Lisp and Mathematica (by the way, flex and bison are the analogous tools available from the GNU Project and the Free Software Foundation). I'd guess that Paul Graham's Arc language will have to provide some of the functionality of a lexical analyzer and a parser appropriately specialized to support programmers building spam filters, but that's just the start.

A point about artificial intelligence and its possible applications is probably in order before we get too much further along. AI technologies embedded in applications have been likened to raisins in raisin bread. Without the raisins you still have bread, it just ain't raisin bread. The machine learning algorithms in a spam filter are like the raisins in the raisin bread; they add value to the product but you still need all the dough around them to sell the product. A spam filter needs tools to analyze and parse INBOX files, a user interface to communicate with the user, and an applications programming interface so that the filter can coordinate with the email client. The amount of code dedicated to machine learning is likely to be modest. That said it's still important. So, let's cut to the chase, forget about the user interface and API and think about what we're going to do with the messages now that we have the messages in a form we can play with.

Let's suppose that we've analyzed and parsed our files of email messages, interfaced with the user, interoperated with the email client, prestidigitated the whoozits and whatsits and generally done everything except separate the spam2 from the ham, the term we'll use to refer to human acceptable mail. The bread is rising and we need to come up with some nice plump raisins. How do we get our spam filter to tell the difference between spam and ham? You might imagine a general-purpose learning algorithm that can take as input the raw text of an email message and is able to make a decision regarding the message's spamishness without our supplying any background knowledge or hints about what's important to attend to and what's not worth thinking about. While there are general-purpose learning algorithms, it often makes more sense (and results in better performance) if we can provide the learning algorithm with a good idea of what aspects or features of an email message are likely to be worth attending to. In the case of trying to distinguish between arctic char and atlantic salmon, an expert might tell a novice to look first at the color of the skin on the underbelly and then at the lower jaw of the fish. What features might be appropriate for a spam filter?

In distinguishing spam from ham, it's often useful to know whether the email came from someone at a company (a dot com) or a school (a dot edu). Other designators, e.g., organizations (a dot org) or the government (a dot gov) may also prove to be relevant. If the subject line indicates that someone is replying to an earlier message from you, e.g., the subject contains "Re:" or "RE:", then that might be a clue. If the subject line or body of the message includes any of the words, "buy", "sell", "winner", "prize", "sale", etc., then the message may be an advertisement or promotion. Messages with attached images or compressed files with the extension "EXE" are generally, but not always, unwanted. How a message is formatted may also provide clues. After you've thought about this for a while, you might think you could write a spam filter yourself. You could specify a bunch of (boolean) features such as, SENDER-DOT-COM, which is true (#t) if the sender of the sender ends with the string "com" or "COM" and false (#f) otherwise. You could process each message to assign values to each of the features you've identified as relevant and then summarize the message as the set of these assignments. For example, our sample message above might look something like the following list in Scheme.

((SENDER-DOT-COM #f)
 (SENDER-DOT-EDU #t)
 (SUBJECT-IS-REPLY #t)
 (SUBJECT-CONTAINS-FREE-WORD #f)
 (BODY-CONTAINS-SALE-WORD #f)
 (BODY-CONTAINS-BUY-WORD #f)
 (BODY-FORMATTED-HTML #f)
 (ATTACHED-JPG-IMG #t)
 (ATTACHED-PDF-DOC #f)
 ...)

Given this set of features, your hand-written spam filter might look something like the following.

if SUBJECT-IS-REPLY
   HAM
else if SENDER-DOT-COM
        if BODY-CONTAINS-SALE-WORD
           SPAM
        else if BODY-CONTAINS-BUY-WORD
                SPAM
             else if SUBJECT-CONTAINS-FREE-WORD 
                    SPAM
                  else HAM
     else if SENDER-DOT-EDU
             if ATTACHED-JPG-IMG
                SPAM
             else HAM
          else if BODY-FORMATTED-HTML
                  SPAM
               else HAM

So, why wouldn't this work besides the obvious problem that if the program got much larger and more complicated it would be impossible to understand? The main problem is that it probably isn't right and while some pieces of the code may make sense, it's really hard for a programmer to keep track of all the cases, especially if, as is likely to be the case, there are hundreds of possibly relevant features instead of the half dozen or so in our small example. What we'd like is to get a learning algorithm to create such a program by carefully looking at all the data, our email messages in this case.

Think about the set of all such programs using only if-then-else statements and a fixed set of boolean features as variables. Let's say there are n boolean features. How many such programs (called boolean functions) are there? The answer is lots. There are 2n possible assignments to the n features and hence 2n possible email messages that we can distinguish (note that some messages will be indistinguishable). A spam filter divides the set of messages into two classes, spam and ham, and hence there are as many boolean functions as there are ways of dividing 2n messages into two classes which is a very big number for any reasonably large n. For a given training set, the set of examples that we'll base our spam filter on, there will very likely be more than one boolean function that correctly classifies all of the examples, identifying spam as spam and ham as ham. For example, suppose that spam could be reliably identified by determining that the sender is a dot com and the body of the message contains HTML formatting and an attached JPG image. Here's a filter implementing this characterization of spam.

if SENDER-DOT-COM
   if BODY-FORMATTED-HTML
      if ATTACHED-JPG-IMG
         SPAM
      else HAM
   else HAM
else HAM

Assuming that our characterization of spam is correct, you could use additional features in implementing a spam filter such as whether or not there are PDF attachments but they wouldn't help. Might they hurt? The answer is yes and for a couple of reasons.

One reason is that the extra features might result in a large and slow running spam filter. If you were to use all n features, you could construct a filter of size 2n which would certainly be impractical for reasonably sized n. A more important reason from the standpoint of learning is that the extra features might mislead you, making it more difficult to generalize beyond the training examples. It's generalizing beyond what you've already seen that you're really after; you've already had to go through all of the training examples classifying them as spam or ham and you'll not likely see these exact message ever again. You want to classify future messages. Suppose that all of the messages in your training set that contained the word "free" in the subject line just happened to be messages that you wanted to read and so you wrote the following filter.

if SUBJECT-CONTAINS-FREE-WORD 
   HAM
else if SENDER-DOT-COM
        if BODY-FORMATTED-HTML
            if ATTACHED-JPG-IMG
               SPAM
            else HAM
        else HAM
     else HAM

With this filter, you'll misclassify spam messages that have "free" in the subject line. The first if-then-else statement in the above classifier is not only superfluous, it leads to misclassification. In practice, the best classifiers appear to be the simplest, most compact and reliant on the fewest features. This observation is in accord with Occam's razor a heuristic or rule of thumb hearkening back to William of Ockham's principle of parsimony suggesting that simpler theories are better. There is no fundamental justification for believing that simpler theories are better in general and indeed there are theorems, called, aptly enough, no free lunch theorems, that suggest that there is no best (or universally optimal) method of choosing classifiers. That said learning algorithms that employ some variant of Occam's razor to choose among classifiers tend to perform well in practice.

How might we design a learning algorithm to select a classifier in accord with Occam's razor? Here's a simple recursive algorithm that is the basis for one of the most widely used and practical methods for machine learning.

The input to our learning algorithm is a set of examples labeled spam or ham. The output will be a classifier consisting of nested if-then-else statements such as we presented above. Each if-then-else statement will use one of the n features as its condition or test and the then and else branches will either contain another (nested) if-then-else statement or a classification, spam or ham. The algorithm, which we'll call build, works recursively as follows: build takes two arguments a set of examples that have yet to be classified and a set of features that have yet to be used. If all of the examples yet to be classified are spam, then build returns SPAM. If all of the examples are ham, it returns HAM. If some are spam and some are ham, build chooses a feature and constructs an if-then-else statement by inserting the selected feature as the test and computing the then and else branches by calling itself recursively on the remaining features and the examples split according to those in which the feature is true (the then branch) and those in which the feature is false (the else branch). Here's what build might look like implemented in Scheme. The following code uses the Lisp quasiquote notation (`) which is quite useful in writing programs that produce other programs as output. The quasiquote suppresses evaluation in the expression that immediately follows except where it encounters a comma , which forces the evaluation of the immediately following expression; so, for example, the expression `(one ,(+ 1 1), three) evaluates to (one 2 three).

(define (build examples features)
  (if (all-spam? examples) 
      'SPAM
    (if (all-ham? examples) 
        'HAM)
       (let ((feature (choose-feature examples features)))
         `(if ,feature
              ,(build (true-examples feature examples)
                      (remove feature features))
           else ,(build (false-examples feature examples)
                        (remove feature features))))))

Of course, this doesn't explain how build produces compact classifiers. The key part of the algorithm is in the choose-feature function which considers how each of the remaining features would split the examples into then and else branches and chooses the feature that results in the "best" split. I'll give you some examples to illustrate how one split might be considered better than another. Suppose choose-feature finds a feature such that the then branch would end up will all spam and the else branch with all ham. That's a great split; you can't get any better than that. Suppose that choose-feature finds a feature such that the then branch ends up with half spam and half ham and ditto for the else branch. That's a lousy split since it didn't make any useful distinctions among the remaining examples; you can't do any worse. In general, splits that result in then and else branches in which the remaining examples have more of one type message than the other are good and those in which the numbers of spam and ham are approximately equal are bad. There are a number of functions that range from zero (equal proportions of spam and ham) to one (all spam or all ham) that serve about equally well for selecting a good split. As you might imagine there are also debates on which functions work best for different learning problems. It's quite amazing however that such a simple algorithm is so effective in such a wide range of problems.

Of course you had to do some of the work. You had to select a set of possibly relevant features. The more features you select the more likely that build will be able to find a good set of features with which to implement a classifier but more features also require build to work harder. And it's always possible that you'll overlook important features. There's no easy solution to this problem; finding good features is hard and that's the reason that experts capable of making fine distinctions in various problem areas are in such demand.

It doesn't appear that Paul Graham is going to use a learning algorithm such as the one we sketched above for his spam filter. Graham plans to use a filter based on probability theory. He's not the first to suggest this and the basic idea is pretty straightforward. I'll say just a bit about the use of probability theory in machine learning since ideas from probability and statistics are often used directly in learning algorithms or are used to analyze or justify techniques which at first blush don't seem to have anything to do with probability or statistics.

I don't want to give you (and you probably don't want to listen to) a primer on probability theory so I'll be a little lax on the formalities in the following. Let's say that Pr(SPAM) is the fraction of all email messages that are spam; we'll call this the probability of spam. According to this definition, Pr(HAM) = 1 - Pr(SPAM) is the fraction of all email messages that are ham. Let's say that Pr(COM,HTML) is the fraction of all email messages such that the sender is a dot com and the body of the message contains HTML. In general, Pr(X1, X2,..., Xk) is the fraction of all email messages satisfying X1 through Xk. Finally, let's say that Pr(SPAM|HTML) is the fraction of email messages containing HTML which are also spam; this is referred to as the conditional probability that a message is spam given that it contains HTML. Stated in another way Pr(SPAM|HTML) is the ratio of Pr(SPAM,HTML) to Pr(HTML) (divide the fraction of messages that contain HTML by the fraction of messages that are both spam and contain HTML). With these simple definitions we can do quite a lot.

To make things simple, suppose that our set of features consists of HTML which is true of a message containing HTML, COM which is true if the sender of the message is from a dot com, and SALE which is true if the word "sale" appears in the message. We'll put an exclamation mark, !, in front of a feature to indicate that the feature is not present so that COM corresponds to (SENDER-DOT-COM #t) and !COM corresponds to (SENDER-DOT-COM #f). Given a message summarized by the following assignment to our features, (BODY-FORMATTED-HTML #t), (SENDER-DOT-COM #t) and (BODY-CONTAINS-SALE-WORD #f), we're interesting in calculating Pr(SPAM|HTML,COM,!SALE), the probability that the message (characterized as containing HTML, being sent from someone at a dot com, and not containing the word "sale") is spam. The way that we'd use this in a spam filter is that, by trial and error we would establish a threshold, say 0.9, such that if Pr(SPAM|HTML,COM,!SALE) > 0.9, then we'll classify the message as spam and eliminate it from the user's INBOX.

Now we don't know Pr(SPAM|HTML,COM,!SALE) and so we'll have to calculate it using the probability calculus, a set of rules for manipulating probabilities. It turns out that while we don't have Pr(SPAM|HTML,COM,!SALE) it's pretty easy to calculate reasonable estimates of probabilities like Pr(HTML|SPAM), Pr(COM|SPAM) and Pr(!SALE|SPAM) from looking at a set of training examples, e.g., the estimate for Pr(HTML|SPAM) is just the ratio of the number of training examples that contain HTML and are spam to the number of examples that are spam. Using these probabilities and certain assumptions about the probability of features co-occurring, we can calculate Pr(SPAM|HTML,COM,!SALE) using Bayes rule, one particularly useful rule from the probability calculus3. Bayes rule and the implications that follow from its use are so important, that an entire branch of probability and statistics, which Graham proposes to borrow from in building his spam filter, is called Bayesian probability theory.

There are complications and problems to be overcome in using lessons from probability theory to design a good spam filter. I have not doubt that it can be done however and, if there aren't already, there soon will be filters on the market that tout the use of Bayesian probability theory. But there will also be spam filters using neural networks, decision trees, genetic algorithms and any of a host of other techniques and algorithms that are being studied in the field of artificial intelligence. The engineers and practitioners in the field keep pushing the envelope of what's possible for computers to do. Today computers can play chess, schedule commercial aircraft, identify credit-card fraud and diagnose problems in automobiles better and faster than humans. Tomorrow ... who knows that they'll be able to do better than us?

I'm often asked for suggestions about what books to read for a good introduction to artificial intelligence and machine learning in particular. Here I'll make a couple of recommendations starting with a self-serving but perfectly reasonable suggestion. A few years ago I co-authored an introductory text in artificial intelligence [Dean et al., 1995] designed for a single semester and covering both theory and practice. The text features a brief introduction to programming in a dialect of Lisp called Common Lisp and gives readers the opportunity to easily connect algorithms to working programs and theory to practice. All of the algorithms are specified both in pseudo code and in Common Lisp and a supplement available on the web provides all of the code described in the book plus examples. The supplement also provides support for translating the Common Lisp code to Scheme.

The text by Russell and Norvig which came out at about the same time as my book [Russell and Norvig, 1995] is much more comprehensive and detailed than mine and would require at least two semesters to do the material justice. The second edition which should be available by the end of 2002 is even more comprehensive and up to date. The supplements for the book include implementations of the algorithms in Common Lisp, Python (which is similar in some respects to Perl but was designed from the bottom up as an object-oriented programming language), Prolog, C++ and Java. This is an excellent text accessible to most college students but also containing advanced material of interest to graduate students.

Tom Mitchell's text entitled "Machine Learning" [Mitchell, 1997] provides a good introduction to basic machine learning theory and algorithms. The web site for the book features code supplements for learning decision trees (written in Common Lisp) and Bayesian learning to classify articles posted to news groups (written in C).

It sounds like the typical sort of thing that an academic would say but I can't emphasize enough the advantages of a thorough grounding in basic mathematics including graph theory, linear algebra and, especially if you're interested in machine learning, probability and statistics. For a solid two-semester introduction to probability and statistics, [DeGroot, 1986] is my favorite textbook. I still consult DeGroot's text and refer my students to it on a regular basis. There's now a third edition available [Degroot and Schervish, 2002].


1. The word "peripatetic" derives from the Greek meaning to walk up and down and is associated with the teaching style of Aristotle and his followers who would pace back and forth while discoursing on philosophical issues. My use of "web-wise peripatetic" alludes to the behavior of many modern academics who hold forth by firing off emails including links to web pages of interest along with their commentary as they browse the web.

Main Entry: peri.pa.tet.ic
Function: noun
Date: 15th century
1 capitalized : a follower of Aristotle or adherent of Aristotelianism
2 : pedestrian, itinerant

Main Entry: peri.pa.tet.ic
Function: adjective
Etymology: Middle French & Latin; Middle French
peripatetique, from Latin peripateticus, from Greek peripatEtikos,
from peripatein to walk up and down, discourse while pacing (as did
Aristotle), from peri- + patein to tread; akin to Sanskrit patha path
Date: 1566
1 capitalized : Aristotelian
2 a : of, relating to, or given to walking b : moving or traveling
from place to place : itinerant

- Merriam Webster's Collegiate Dictionary

2. SPAM is a registered trademark of the Hormel Foods Corporation and, from what I've been told, it makes a pretty good sandwich if you're into eating meat. It is not however ham, pork meat which comes from the hind leg of a hog, for which SPAM is sometimes used as an inexpensive substitute.

3. Here is a quick overview describing how we can use Bayes rule to calculate Pr(SPAM|HTML,COM,!SALE). First, we derive Bayes rule using only the definition of conditional probability. In the following derivation, each line or step in the derivation is numbered with the justification and the previous lines used in the step. The last line (4) of the following derivation is the simplest formulation of Bayes rule.

1. Pr(A|B) =  Pr(A,B)              - Definition of Pr(A|B)
               Pr(B)
2. Pr(B|A) =  Pr(A,B)              - Definition of Pr(B|A)
               Pr(A)
3. Pr(A,B) = Pr(B|A) Pr(A)         - Algebraic manipulation using (2)

4. Pr(A|B) =  Pr(B|A) Pr(A)        - Substitution using (1) and (3)
                 Pr(B)

We're also going to rely on the following rule of probability, Pr(B) = Pr(B|A) + Pr(B|!A), which we justify informally as follows. Think concretely about the probability that a message contains HTML. If we know the probability that a message contains HTML given that it's spam (the proportion of spam messages containing HTML) and the probability that a message contains HTML given it's ham (the proportion of ham messages containing HTML), then, given that a message has to be either spam or ham, the sum of these two probabilities is the proportion of all messages containing HTML, i.e., the probability that a message contains HTML.

Finally, we're going to need a way of calculating Pr(HTML,COM,!SALE|SPAM). If we had enough training examples, we could do this directly, but if we have many more features, as we are likely to need in any really effective spam filter, then we will need far too many training examples to get accurate estimates for all of the probabilities of this form. Think about how many probabilities of this form you would actually need; for just three features we'd need Pr(HTML,COM,SALE|SPAM), Pr(!HTML,COM,SALE|SPAM), Pr(HTML,!COM,SALE|SPAM), and so on for all eight possible combinations. If you had only ten features (and that's not likely to be enough) and you needed even ten examples to estimate each probability (and that's not nearly enough), then already you'd need at least 10,000 training examples and there are likely to be some combinations for which you still wouldn't have enough examples.

As an expedient, we're going to assume that all of our features are mutually independent given the status of a message as spam or ham. For example, if true, this would mean that the presence of HTML in a document is completely independent of it coming from a dot com. This assumption of independence is not likely to be true for all pairs of features; for instance, if we see the word "sale" in the subject line of a spam message, the message is far more likely to be from a dot com than from a dot edu, so COM and SALE are not independent of one another given spam. Even so such assumptions are often made on the justification that it won't hurt too much and practically speaking there aren't many alternatives. If we accept this assumption, then we can calculate Pr(HTML,COM,!SALE|SPAM) as the product of Pr(HTML|SPAM), Pr(COM|SPAM) and Pr(!SALE|SPAM), all probabilities for which estimates are relatively easy to obtain. So given the above derivations and assumptions, we can calculate the desired probabilities as follows.

                          
Pr(SPAM|HTML,COM,!SALE) =   Pr(HTML,COM,!SALE|SPAM) Pr(SPAM)  
                                  Pr(HTML,COM,!SALE)

Pr(HTML,COM,!SALE|SPAM) = Pr(HTML|SPAM) Pr(COM|SPAM) Pr(!SALE|SPAM) 

Pr(HTML,COM,!SALE) = Pr(HTML,COM,!SALE|SPAM) + 
                     Pr(HTML,COM,!SALE|!SPAM) 
                   = Pr(HTML|SPAM) Pr(COM|SPAM) Pr(!SALE|SPAM) +
                     Pr(HTML|!SPAM) Pr(COM|!SPAM) Pr(!SALE|!SPAM) 

Note that the assumption that all of the features are mutually independent given their designation as spam or ham, the assumption that allows us to compute Pr(HTML,COM|SPAM) as the product of Pr(HTML|SPAM) and Pr(COM|SPAM), is not justified. It constitutes a concession to computational considerations and as such it warrants great care in it's application. There are other concessions implicit in the approach described above. In practice, it is likely that for some of the features you decide to use there will be very little data available to estimate the required probabilities. You can choose to throw such features out or you can use them with care. Quantifying the consequences of using such questionable features is the province of applied statistics.