Searching the web implies figuring out the meaning of queries and the content of web pages. The problem is how to get a machine to capture the content (meaning) of a web page.

There is an area of Artificial intelligence called Natural Language Understanding that attempts to develop programs that can understand human texts. How might you determine if a program understands the content of a web page? What about getting the program to generate summaries or answer questions?

In class, we're going to look at some web pages and see if you can figure out what they're about. All of the Scheme and Perl code used in this lecture and referred to in the book can be downloaded as a zip archive.

We've selected four web pages to use as examples and we'll be passing out different versions and summaries of these pages in class. First we submit the relevant URLs to a browser and then simply print them out. Here's an example of what a browser starts with:

% set url = "www.cs.brown.edu/people/tld/file.html"
% curl $url 
<HTML>
<HEAD>
<TITLE>Simple HTML File</TITLE>
</HEAD>
<BODY>
<H1>Section Heading</H1>
<P>
Here is a file containing HTML formatting 
commands that has a body consisting of 
one section heading and a single paragraph.
</P>
</BODY> 
</HTML>

Of course, if there are images, tables, banners, background colors, etc., then these will be referred to in the HTML and downloaded and displayed by the browser as appropriate.

Some of you will be given printouts generated by standard graphical browsers for the four example pages. Look at the page, read the text and on the reverse side of the printout write a one-sentence summary.

Next we use the parser in a text-based browser called lynx to strip off all of the HTML formatting commands and display only the raw text:

% lynx -dump -nolist $url

   Section Heading

   Here is a file containing HTML formatting 
   commands that has a body consisting of 
   one section heading and a single paragraph.

Another group of you will be given these pure-text printouts. Here again you are to look at the printout, read the text and then on the reverse side of the printout write a one-sentence summary.

Next we're going to get rid of everything but the words and the number of times they appear in the web page. The Perl program wf.pl computes the frequency of the terms in a text file and then sorts the terms by frequency. In the following code, the text to the right of sharp (#) signs corresponds to comments:

% cat wf.pl
#!/usr/bin/perl
while ($_ = <ARGV>) {           # for each line in the file,
    tr/A-Z/a-z/;                # convert the string to lowercase
    @words = split(/\W+/, $_);  # split the line into a list of words
    foreach $word (@words) {    # for each word, increment the number
        $wordcount{$word}++;    # of times you've seen it in the file
    }
}
sub bycount {                   # compare two words by their frequency 
    $wordcount{$a} <=> $wordcount{$b};
}
foreach $word (sort bycount keys ( %wordcount )) {  # sort the words
    printf "%10s %d\n", $word, $wordcount{$word};   # and print them
}

Let's look at the frequency of the terms appearing on my web page.

% set url = "www.cs.brown.edu/people/tld/home.html"

Just looking at the most frequently appearing words we get:

% lynx -dump -nolist $url | perl wf.pl | tail 
      from 7
       for 7
        is 8
         i 10
         a 14
        in 16
        of 17
       the 20
       and 25

And looking at the least frequently appearing words we get:

% lynx -dump -nolist $url | perl wf.pl | head
satisfying 1
        up 1
        cs 1
      risk 1
       tld 1
       inc 1
     range 1
      many 1
     ijcai 1
     years 1

Most methods for summarizing text in terms of word frequencies also use some sort of method for stemming words. Stemming reduces different words that originate from the same root to the common root. For example, consider the roots of the following words:

% cat words.txt
word words wording      
gain gains gaining
sink sinks sinking sinker
java javas

Here's how a popular algorithm for stemming called the Porter stemmer and implemented in Perl as porter.pl handles these words:

% cat words.txt | perl porter.pl
word word word  
gain gain gain
sink sink sink sinker
java java

Notice how stemming changes the list of the highest frequency terms on my home page:

% lynx -dump -nolist $url | perl porter.pl | perl wf.pl | tail
        is 8
  intellig 8
    comput 9
         i 10
         a 14
        in 16
        of 17
       the 20
       and 25

Most of the time stemming makes sense but sometimes the process conflates words that really are different. Notice that in particular:

% echo "scheme schemes scheming" | perl porter.pl
scheme scheme scheme
% echo "university universe universal" | perl porter.pl
univers univers univers

Clearly the method of stemming used by porter.pl doesn't understand the meaning of the words it's working on. Can you think of a more intelligent method for stemming words?

A final group of you will be given the term-frequency summaries for the four web pages; scan the frequency information and then write a one-sentence summary.

In class we'll compare the summaries based on the three different types of information.

Search Engines and Term Frequency Vectors

Most search engines:

Determine an appropriate dictionary of terms for summary purposes.
Use a standard format for summaries called a term-frequency (TF) vector.
Compare web pages using some measure (e.g., cosine angle) on TF vectors.
Answer queries in part by comparing with TF query vectors.

Think about how we might improve TF vectors:

Restricted dictionaries, e.g., eliminate commonly appearing words.
Better stemming, e.g., make use of word meaning dictionaries.
Weight the vector components to reflect the entire corpus of documents.

Here are the standard methods for improving on raw term frequencies. First we define a function to determine if a term occurs in a document or not:

        occurs(d,t) = 1    if term t occurs in document d 
                    = 0    otherwise

Here's a form of simple ``normalized'' term frequency:

        TF(d,t) = occurs(d,t) / sum_d' occurs(d',t)

Here's a method called inverse document frequency term weighting:

        IDF(t) = log(1 + number-of-docs) / sum_d occurs(d,t)

We can combine term frequency and inverse document frequency in a single measure:

        TFIDF(d,t) = TF(d,t) * IDF(t)

Think about when these methods are appropriate and when they might lead one astray.

Example Exercises

The following excerpt was mined from the KD (Knowledge Discovery) Nuggets web page:

Bill Palace at UCLA (Spring 1996) in his web lecture notes writes ``For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.''

Think about how you might discover similar connections involving products and how you might exploit these connections to improve sales. Write prose describing an algorithm that implements your ideas.

Code Fragments

Download all the code fragments in this chapter as a file in zip format.

Searching the Wild Web

Sample Lecture Material

Summarizing the Web as a Vector Space

Search Engines and Term Frequency Vectors

Example Exercises

Code Fragments