Forest for the Trees

Supplementary References

Try to come up with interesting examples of graphs and networks, especially large ones. Think about the graph-searching problems faced by companies like mapquest and yahoo that generate driving directions as part of their online services. Find out about the graph structure of the computer networks that your computer is connected to; for example, there may be a local network in your home, office or dorm, a somewhat more extensive network provided by your Internet service provider, company or university, and then there are the computers and connections that comprise those portions of the Internet in your geographical area.

Learn more about shell commands that allow you to maneuver in the graphs corresponding to hierarchical filesystems. Use man, info and the web to learn about the difference between built-in commands like cd and chdir and other commands like ls. You can use which to determine what type of command you're using; for example, type which cd to a shell. Many shell commands including ls, rm and find will search recursively through subdirectories depending on the options you give them. Learn more about these commands using man and info and try to figure out what search methods they employ.

Sample Lecture Material

Drawing Graphs

Drawing graphs or more precisely designing layouts for graphs so that they can be rendered on the printed page is an interesting and challenging area of computer science. It combines aspects of computational geometry which is concerned with algorithms involving geometric shapes and esthetics since what constitutes a useful layout often corresponds with what constitutes an esthetically pleasing layout. For an introduction to this field, you might visit Roberto Tamassia's web page on graph drawing which includes a tutorial, a bibliography of related work, and links to various graph drawing resources.

Graph drawing provides a means for scientists and engineers to visualize their data. Graph drawing is important in cartography (think about designing maps), genealogy (think of family trees), sociology (think of social networks), and various aspects of computer science from visualizing the dependencies among software modules to understanding the structure of hyperlinked web pages in the World Wide Web. It involves laying out the nodes and arcs on a page and placing labels such as the names of cities or streets so that they can be read easily.

There are a bunch of graph drawing programs available on the web; there are even web servers that allow you to submit a graph in a suitable form and will produce for you a rendered graph in the graphical format of your choice. We're going to look at a collection of graph drawing tools called GraphViz which is available as open source software from AT&T. If the GraphViz tools are not already installed on your computer, visit the AT&T web site and download a binary version appropriate for your operating system. We're going to look at two tools in the GraphViz collection of tools, dot and neato, which are really useful if you work a lot with graphs.

The dot program draws directed graphs and has its own language for describing graphs. Here's graph E from Chapter 13 described in the dot language and a demonstration of how you might call dot to produce a layout and then display it:

% cat e.dot
digraph E {
    1 -> 2 ;
    2 -> 3 ;
    2 -> 4 ;
    4 -> 5 ;
    4 -> 6 ;
    4 -> 1 ;
}
% dot -Tps e.dot -o e.eps
% epstopdf e.eps
% ls a.*
e.dot e.eps e.pdf
% acroread e.pdf

I told dot to render the graph in EPSF (Encapsulated Postscript Format) by using the -Tps option. dot can also produce graphical output using the GIF (Graphical Interchange Format) (using -Tgif) and JPEG (Joint Photographic Experts Group) (using -Tjpg) formats among others. The command epstopdf converts a document in EPSF to PDF (Portable Document Format) so that it can be displayed using the free PDF reader from Adobe. The epstopdf program may not be installed on your computer; if not, invoke dot with the either the -Tgif or the -Tjpg option and then use your browser to view the resulting GIF or JPEG file.

The dot graph language is fairly expressive allowing you to label nodes and arcs, display nodes in various shapes and create hierarchical nested graphs. Other programs implementing various graph algorithms have been designed to operate on the dot graph representation. For example, gc does for graphs what wc does for documents; it prints out the number of nodes, edges or connected components contained in a graph. You can learn more using man and info (try info dot). Here's another example showing off some of dot's capabilities:

% cat fancy.dot
digraph fancy {
    one [shape=box] ;
    two [shape=triangle] ;
    one -> { two ; three } ;
    five [shape=polygon,sides=5] ;
    three -> four -> five -> one ;
}
% dot -Tps fancy.dot -o fancy.eps
% epstopdf fancy.eps
% ls fancy.*
fancy.dot fancy.eps fancy.pdf
% acroread fancy.pdf

The neato command draws undirected graphs and has its own format. Here's graph A from Chapter 13:

% cat a.neato
graph A {
    2 -- 1 -- 5 ;
    4 -- 1 -- 3 ;
}
% neato -Tps a.neato -o a.eps
% epstopdf a.eps
% ls a.*
a.dot a.eps a.pdf
% acroread a.pdf

Think about the relative difficulty of laying out directed and undirected graphs. Which class of graphs do you think would be more difficult? Find out what a hierarchical graph is and think about why a hierarchical organization of nested subgraphs might make it easier to produce a comprehensible graph. We'll come back to drawing graphs in the next lecture when we have a somewhat more challenging graph to render.

Web Crawling

I'm going to assume that you understand the basic graph searching algorithms, breadth-first and depth-first, from reading the book. If you want to test your knowledge, check check out the Java algorithm animations for breadth-first and depth-first available from Duke University.

In this lecture, we're going to use another very useful shell program for crawling around on the World Wide Web. lynx is a text-based browser; it doesn't rely on any fancy graphics engines or window managers and it will work in an interactive mode within a simple terminal window or it can be invoked as a command in a shell script. We'll see lynx again in Chapter 14 on searching the web, but here we'll use it to construct the graph corresponding to small fragments of the web. We're going to be creating a bunch of files in our experiments and so we'll create a subdirectory in which to store them.

% mkdir web
% cd web

% set url = http://www.cs.brown.edu/~tld/talk/home.html

If you have a shell that emulates a reasonably powerful terminal, for example a vt100 (you can check this by typing echo $TERM, then you can use the following command.

% lynx -traversal -crawl -realm -number-links $url

This command starts lynx and then loads each file as it traverses the links; you'll see a bunch of files loaded and the web content flashing by as lynx examines each web page. If you find the web content flashing by annoying, you can divert the content into what amounts to a big bit bucket (the output is just ignored) by using the redirection operator as follows:

% lynx -traversal -crawl -realm -number-links $url > /dev/null

Or if you have dumb terminal, i.e., TERM = dumb, and can't run lynx in interactive mode, then you can fool lynx into thinking you have a vt100 terminal and then divert everything into the bit bucket anyway.²⁵

% lynx -term=vt100 -traversal -crawl -realm -number-links $url > /dev/null

The -traversal switch tells lynx to follow all links beginning at the specified start page. The -realm switch tells lynx to only following links in the same realm as the startpage, for example http://www.cs.brown.edu/~tld/talk/ tells lynx to restrict its attention to pages that have http://www.cs.brown.edu/~tld/talk/ as a base. The -crawl switch combined with -traversal causes lynx to output each unique hypertext page as an lnk#####.dat file in a format that we'll discuss presently.

With the switches set as specified in the above commands, lynx generates several files:

traverse.dat: a list of all the URLs that were traversed; each URL appears exactly once in this file.
traverse2.dat: a list of all URLs that were traversed in the order in which they were encountered or re-encountered during the traversal plus the titles of the corresponding web pages.
reject.dat: a list of URLs that have been rejected for one reason or another, including that a hyperlink is not within the indicated realm.
traverse.errors: this file is generated only if any errors are encountered during the traversal.
lnk#####.dat: a set of output files containing the contents of traversed hypertext documents in text format; with the -number_links switch, all hypertext links in a given document are numbered and added to the end of document's file.

Here are the files that were produced by the lynx command given above:

% ls
lnk00000000.dat lnk00000005.dat lnk00000010.dat lnk00000015.dat lnk00000020.dat
lnk00000001.dat lnk00000006.dat lnk00000011.dat lnk00000016.dat lnk00000021.dat
lnk00000002.dat lnk00000007.dat lnk00000012.dat lnk00000017.dat reject.dat
lnk00000003.dat lnk00000008.dat lnk00000013.dat lnk00000018.dat traverse.dat
lnk00000004.dat lnk00000009.dat lnk00000014.dat lnk00000019.dat traverse2.dat

Let's take a look at one of the lnk#####.dat files:

% cat lnk00000000.dat
THE_URL:http://www.cs.brown.edu/~tld/talk/home.html
THE_TITLE: Talking With Computers 

   [Go to first, previous, [1]next page;   [2]contents;   [3]index]

                        Talking With Computers

   This web site provides supplementary material for Talking With
   Computers written by Tom Dean and published by Cambridge University
   Press. You'll find a separate web page for each chapter providing
   easy access to code fragments from the book, tips on finding and
   installing software, links to online resources, sample exercises
   and lectures, and corrections to the published text. This material
   was produced in collaboration with Katrina Ligett and Damien
   Suttle. If you have questions or suggestions for improving these
   pages, please send email to [4]tld@cs.brown.edu.

   [Go to first, previous, [5]next page;   [6]contents;   [7]index]

   Last modified: Mon, July 28, 2003, 3:47 pm

References
   1. http://www.cs.brown.edu/~tld/talk/home-Z-H-1.html
   2. http://www.cs.brown.edu/~tld/talk/home-Z-H-1.html#node_toc_start
   3. http://www.cs.brown.edu/~tld/talk/home-Z-H-19.html#node_index_start
   4. mailto:tld@cs.brown.edu
   5. http://www.cs.brown.edu/~tld/talk/home-Z-H-1.html
   6. http://www.cs.brown.edu/~tld/talk/home-Z-H-1.html#node_toc_start
   7. http://www.cs.brown.edu/~tld/talk/home-Z-H-19.html#node_index_start

If you think about it a bit, you'll realize that the lnk#####.dat files taken together specify a graph in adjacency-list format: each node (specified in the first line) along with all the nodes to which it is linked (listed as hyperlinks under References). Let's see if we can convert this graph representation into a format that dot can use to render the graph in a suitable graphical format.

The following Perl program build.pl does the job. It's basic Perl but it's still pretty inscrutable for the uninitiated. Here's how it works: First the program goes through the traverse.dat and reject.dat files creating two hash tables traverse and reject that map URLs to unique integer identifiers. Then it goes through each of the lnk#####.dat files in turn converting each hyperlink to an arc in dot format, i.e., a string of the form i -> j ; where i is the integer id for the document's URL and j is the integer id for the URL associated with the hyperlink. The program also handles the rest of the syntax for the digraph format and adds dot commands so that rejected URLs are displayed as boxes while traversed URLs are displayed using the default ovals. Here is the sparsely commented program listing:

#!/usr/bin/perl

# Delete any old idx and dot files.
`/bin/rm -f *.idx *.dot` ;

# Initialize the URL id counter to zero.
$i = 0 ;

# Assign ids to all the URLs in traverse.dat.
open(TDAT, "traverse.dat" ) ;
# Create a file mapping traversed URLs to ids.
open(TIDX, "> traverse.idx" ) ;
while ( <TDAT> ) {
    if ( /(http:[\/\S.]*)/ ) {
        $traverse{$1} = ++$i ;
        printf TIDX "%s. %s\n", $i, $1 ;
    }
}
close(TIDX) ;
close(TDAT) ;

# Assign ids to all the URLs in reject.dat.
open(RDAT, "reject.dat" ) ;
# Create a file mapping rejected URLs to ids.
open(RIDX, "> reject.idx" ) ;
while ( <RDAT> ) {
    if ( /(http:[\/\S.]*)/ ) {
        $reject{$1} = ++$i ;
        printf RIDX "%s %s\n", $i, $1;
    }
}
close(RIDX) ;
close(RDAT) ;

# Create a new dot file for the web graph.
open(TDOT, "> traverse.dot" ) ;
# Begin writing the digraph in dot format.
printf TDOT "digraph traverse {\n" ;
# Loop through each file corresponding to a traversed document.
foreach $file ( <lnk*.dat> ) {
    # Use to keep track of repeated URLs.
    my %seen ; 
    # Open the data file for reading. 
    open(FILE, $file) ; 
    # Read in the first line of the file.
    $_ = <FILE> ;
    # This line should include a URL from traverse.dat.
    if ( /(http:[\/\S.]*)/ && $traverse{$1} ) {
        $start = $traverse{$1} ;
        while ( <FILE> ) {
            # Create exactly one arc per unique URL in the document.
            if ( /(http:[\/\S.]*)/ && !$seen{$1} ) {
                $seen{$1} = 1 ; 
                # Handle traversed and rejected URLs differently.
                if ( $traverse{$1} ) {
                    printf TDOT "%s -> %s ;\n", $start, $traverse{$1} ; 
                }
                # Rejected URLs are identified by a box shape.
                elsif ( $reject{$1} ) {
                    printf TDOT "%s [shape=box] ;\n", $reject{$1} ;
                    printf TDOT "%s -> %s ;\n", $start, $reject{$1} ; 
                }
            }
        }
    }
}
# Balance the parenthesis that opened the digraph block.
printf TDOT "}\n" ;
close(TDOT) ;

Now let's see how it works.

% perl build.pl
% dot -Tps traverse.dot -o traverse.eps
% epstopdf traverse.eps
% acroread traverse.pdf

In this case, dot did a pretty good job of laying out the graph to reveal the structure of this web site. The Perl code along with all the files produced by the lynx command, the graph produced by build.pl and the EPSF and PDF files produced by dot and epstopdf are available for download in an archive file or you can download just the PDF file showing the graph.

Vectors and Matrices

Linear algebra is a area of mathematics that figures prominently in applications from computer graphics to communication networks. Linear algebra deals with a variety of mathematical objects some of which are said to be indexed. In addition to numbers which are often referred to as scalars and are not indexed, there are vectors which have a single index, matrices which have two indices usually referred to as rows and columns given the obvious representation of a matrix as a table, and more general structures called tensors which include vectors and matrices as special cases and can have any number of indices.

There are as many interpretations of vectors and matrices as there are applications that make use of them. Vectors can be used to describe the positions of objects in a Euclidean space or the configuration of a robot with multiple degrees of freedom in some abstract ``joint'' space. In Chapter 14 we use vectors to summarize the content of web pages. And in Chapter 13 we use matrices to represent the links between nodes in graphs. In this lecture, we'll consider graphs whose nodes correspond to cities (e.g., Boston BOS, Providence PVD and New York NYC) and use a matrix to represent the costs (or distances) involved in using available rail service to get from one city to another. So we might have the following matrix assuming that there is no direct rail service between Boston and New York (and hence the distant is inf for infinite distance or cost):

     BOS PVD NYC
BOS   0   1  inf
PVD   1   0   2
NYC  inf  2   0

And here's a symbolic version of this matrix:

     BOS      PVD      NYC
BOS  BOStoBOS BOStoPVD BOStoNYC 
PVD  PVDtoBOS PVDtoPVD PVDtoNYC 
NYC  NYCtoBOS NYCtoPVD NYCtoNYC

We might also have a vector to represent either the initial cost of leaving from a city or the final cost of ending up in a given city:

BOS  BOSbase
PVD  PVDbase
NYC  NYCbase

We'll use Mathematica's facility with symbolic computation to illustrate some points and then we'll implement some of the basic linear algebra operations in Scheme. First, let's convert the travel matrix and city vector to Mathematica notation:

In[1]:= TableForm[ u = {BOSbase, PVDbase, NYCbase} ]
Out[1]//TableForm= BOSbase

                   PVDbase

                   NYCbase

In[2]:= TableForm[ M = {{BOStoBOS, BOStoPVD, BOStoNYC},
                        {PVDtoBOS, PVDtoPVD, PVDtoNYC}, 
                        {NYCtoBOS, NYCtoPVD, NYCtoNYC}} ]
Out[2]//TableForm= BOStoBOS   BOStoPVD   BOStoNYC

                   PVDtoBOS   PVDtoPVD   PVDtoNYC

                   NYCtoBOS   NYCtoPVD   NYCtoNYC

There are a number of useful operations involving scalars, vectors and matrices. For example you can take the dot product of two vectors of the same length. The dot product of two vectors of the same length is a scalar corresponding to the sum of the pairwise products of their respective components. See the dot product entry at Eric Weisstein's World of Mathematics for the standard geometric interpretation. You can implement the dot product of two vectors in Scheme using the apply and map operators as follows (the code implementing matrix and vector operations used in this lecture is available here):

(define (dot_vector_vector u v) 
  (apply + (map * u v)))

> (dot_vector_vector '(1 2 3) '(3 4 5))
26

You can also multiply a matrix with N rows and M columns by a vector with M components. The result is a vector of length N whose ith component is the dot product of the vector with the vector corresponding to the ith row of the matrix. This operation is considered a generalization of dot product and can be defined in Scheme as follows:

(define (dot_matrix_vector m v)
  (map (lambda (m_i) (dot_vector_vector m_i v)) m))

> (dot_matrix_vector '((1 2 3) (4 5 6)) '(7 8 9))
(50 122)

The transpose of an N by M matrix is an M by N matrix with the rows and columns interchanged:

> (transpose '((1 2 3) (4 5 6)))
((1 4) (2 5) (3 6))

Here's an implementation of the transpose function in Scheme:

(define (transpose m)
  (if (null? (car m))
      (list)
    (cons (map car m) 
          (transpose (map cdr m)))))

Using the transpose function we can define the dot product of a vector and a matrix which is not the same as the product of the matrix and the vector:

(define (dot_vector_matrix v m)
  (map (lambda (m_i) (dot_vector_vector v m_i)) (transpose m)))

> (dot_vector_matrix '(7 8 9) '((1 2) (3 4) (5 6)))
(76 100)

Finally you can multiply matrices; the dot product of a matrix with N rows and M columns with a matrix of M rows and P columns is a new matrix with N rows and P columns whose ijth component is the dot product of ith of row of the M by N matrix with the jth column of the N by P matrix. Work out an example to convince yourself that the following procedure implements this definition.

(define (dot_matrix_matrix a b)
  (transpose (map (lambda (b_i) (dot_matrix_vector a b_i)) (transpose b))))

Speaking of examples, we're long overdue for a couple. Think about the following three products and come up with an interpretation for each. Note that Mathematica does the right thing as defined above whether we're multiplying a vector times a matrix (u.M), a matrix time a vector (M.u), or a matrix times another matrix (M.M). How do you think Mathematica manages this and how would you implement a version of dot in Scheme that is similarly smart about handling its arguments?

In[4]:= TableForm[ u . M ]
Out[4]//TableForm= BOSbase BOStoBOS + NYCbase NYCtoBOS + PVDbase PVDtoBOS

                   BOSbase BOStoPVD + NYCbase NYCtoPVD + PVDbase PVDtoPVD

                   BOSbase BOStoNYC + NYCbase NYCtoNYC + PVDbase PVDtoNYC

In[5]:= TableForm[ M . u ]
Out[5]//TableForm= BOSbase BOStoBOS + BOStoNYC NYCbase + BOStoPVD PVDbase

                   BOSbase PVDtoBOS + NYCbase PVDtoNYC + PVDbase PVDtoPVD

                   BOSbase NYCtoBOS + NYCbase NYCtoNYC + NYCtoPVD PVDbase

In thinking up interpretations of these products, recall that in Mathematica the juxtaposition of two terms typically means multiplication, so that PVDbase PVDtoBOS should be thought of as PVDbase * PVDtoBOS. However, in trying to come up with interesting interpretations, you might want to think about alternatives to * and + to use as primitive operations for combining terms.

In[6]:= TableForm[ M . M ]
Out[6]//TableForm= 
            
    BOStoBOS BOStoBOS + BOStoNYC NYCtoBOS + BOStoPVD PVDtoBOS

    >  BOStoBOS BOStoPVD + BOStoNYC NYCtoPVD + BOStoPVD PVDtoPVD

    >  BOStoBOS BOStoNYC + BOStoNYC NYCtoNYC + BOStoPVD PVDtoNYC

    BOStoBOS PVDtoBOS + NYCtoBOS PVDtoNYC + PVDtoBOS PVDtoPVD
                                                         
    >  BOStoPVD PVDtoBOS + NYCtoPVD PVDtoNYC + PVDtoPVD PVDtoPVD

    >  BOStoNYC PVDtoBOS + NYCtoNYC PVDtoNYC + PVDtoNYC PVDtoPVD

    BOStoBOS NYCtoBOS + NYCtoBOS NYCtoNYC + NYCtoPVD PVDtoBOS

    >  BOStoPVD NYCtoBOS + NYCtoNYC NYCtoPVD + NYCtoPVD PVDtoPVD
                                     
    >  BOStoNYC NYCtoBOS + NYCtoNYC NYCtoNYC + NYCtoPVD PVDtoNYC

Mathematica has a generalization of the dot product function called Inner (in mathematics the term ``inner product'' is often used synonymously with ``dot product'') which allows the programmer to use different primitive functions instead of Times (*) and Plus (+). Here I substitute Plus for Times and Min for Plus. How would you interpret this operation?

In[7]:= TableForm[Inner[Plus, M, M, Min]]
Out[7]//TableForm= 
 
    Min[BOStoBOS + BOStoBOS, BOStoNYC + NYCtoBOS, BOStoPVD + PVDtoBOS]            
     
    >  Min[BOStoBOS + BOStoPVD, BOStoNYC + NYCtoPVD, BOStoPVD + PVDtoPVD]

    >  Min[BOStoBOS + BOStoNYC, BOStoNYC + NYCtoNYC, BOStoPVD + PVDtoNYC]

    Min[BOStoBOS + PVDtoBOS, NYCtoBOS + PVDtoNYC, PVDtoBOS + PVDtoPVD]
     
    >  Min[BOStoPVD + PVDtoBOS, NYCtoPVD + PVDtoNYC, PVDtoPVD + PVDtoPVD]
     
    >  Min[BOStoNYC + PVDtoBOS, NYCtoNYC + PVDtoNYC, PVDtoNYC + PVDtoPVD]

    Min[BOStoBOS + NYCtoBOS, NYCtoBOS + NYCtoNYC, NYCtoPVD + PVDtoBOS]   
     
    >  Min[BOStoPVD + NYCtoBOS, NYCtoNYC + NYCtoPVD, NYCtoPVD + PVDtoPVD]
     
    >  Min[BOStoNYC + NYCtoBOS, NYCtoNYC + NYCtoNYC, NYCtoPVD + PVDtoNYC]

Here's an alternative vector product in Scheme with + substituted for * and min substituted for +:

(define (alt_dot_vector_vector u v)
  (apply min (map + u v)))

And here are some alternatively defined matrix products using the alternative vector product function:

(define (alt_dot_matrix_vector m v)
  (map (lambda (m_i) (alt_dot_vector_vector m_i v)) m))

(define (alt_dot_matrix_matrix a b)
  (transpose (map (lambda (b_i) (alt_dot_matrix_vector a b_i)) (transpose b))))

Using DrScheme's infinite number notation (actually part of MzScheme), we can compute the shortest distance between any of our three cities as follows:

> (define travel '((    0     1  +inf.0 )
                   (    1     0    2    )
                   ( +inf.0   2    0    )))
> (alt_dot_matrix_matrix travel travel)
((0.0 1.0 3.0) (1.0 0 2.0) (3.0 2.0 0.0))

In general, you have to multiply the matrix by itself multiple times in order to get the shortest path in the case of more than three cities:

(define (shortest_path_all_pairs m)
  (aux_shortest_path_all_pairs m (length m)))

(define (aux_shortest_path_all_pairs m n)
  (if (= n 2) 
      m
    (aux_shortest_path_all_pairs (alt_dot_matrix_matrix m m)
                                 (- n 1))))

> (define travel '((   0       1   +inf.0     2   )
                   (   1       0      2    +inf.0 )
                   ( +inf.0    2      0       4   )
                   (   2    +inf.0    4       0   )))
> (shortest_path_all_pairs travel)
((0.0 1.0 3.0 2.0) (1.0 0.0 2.0 3.0) (3.0 2.0 0.0 4.0) (2.0 3.0 4.0 0.0))

Determine how many multiplications and additions are required to compute the shortest path between all pairs of n cities using this algorithm. Start by determining how many primitive operations are required to multiply two matrices. How many operations are required each time you call dot_vector_vector and how many times do you call this function in multiplying a pair of matrices? Could you perform this calculation using fewer primitive operations? Look up Dijkstra's algorithm for computing the shortest paths from a given starting city to all other cities and Floyd-Warshall's algorithm for computing the shortest paths for all pairs of cities. It's possible to compute the shortest path with some specified probability or a path that is close to the shortest path with some specified probability and do so a lot faster than Dijkstra's algorithm. Are there circumstances in which you'd sacrifice accuracy for speed?

Example Exercises

Use dot and neato to specify and then render a reasonably complicated graph characterizing the social relationships within a group of people. For example, construct a chat graph whose nodes consist of people and whose arcs correspond to reciprocal instant messaging relationships. A somewhat less interesting social graph might be the graph corresponding to the organizational chart for a company; you might make such a graph more interesting by adding links between people who participate in the same carpool, exercise class or some other group outside of work. What's the diameter of your graph? Learn about social networks, small-world phenomena, the ``six degrees of separation'' principle, and the work of Stanley Milgram in the 1960s. See The Small World Web by Lada Adamic [Adamic1999] or Six Degrees: The Science of a Connected Age by Duncan Watts [Watts2003] for some insight into how the World Wide Web is related to small-world phenomena. Try Google or CiteSeer to track down related work. You also can download and experiment with some DrScheme code I wrote for demonstrating simple small-world phenomena; when loaded, the code constructs a graph corresponding to a small-world network and then simulates how an idea, a rumor or a communicable disease might propagate through such a network.

The Perl pattern matching expression /(http:[\/\S.]*)/ works as follows: If the attempt to match against the current line (which is stored in the variable $_) is successful, then, following the match, the portion of the line that matches against the regular expression http:[\/\S.]* will be stored in the variable $1. The opening /( followed by the closing /) delimit the portion of the match that is stored in $1 (which happens to be the portion matching the entire regular expression in this case but that need not be so in general). The regular expression works as follows: The http: part is a simple string and requires an exact match against the current line. The square brackets [ and ] enclose a Perl character class; in this case consisting of the forward slash (\/), any non-whitespace character (\S), and the period (.). The asterisk (*) specifies that the match must match against some character in the character class zero or more times and it works ``greedily'' in that it tries to match as many times as it possibly can.

So the expression /(http:[\/\S.]*)/ will match http://cs.brown.edu/~tld/talk/home.html and it will also match http://www.cs.brown.edu/~tld/talk/home-Z-H-1.html#node_toc_start; unfortunately, the latter URL is not in traverse.dat nor is it in rejects.dat. URLs of the form http://base#name allow hyperlinks that jump directly to named sections of a document. Learn just enough about Perl regular expressions to alter the expression /(http:[\/\S.]*)/ so that it matches base URLs but excludes the part that references named sections.²⁶ Change build.pl accordingly and then use the modified build.pl and dot to construct and display a new graph which should look something like the following:

Code Fragments

Download all the code fragments in this chapter as a file in zip format.

²⁵ If you try to divert the output to /dev/null without setting the -term switch to something like vt100 and you're running in a shell with TERM = dumb, then lynx will prompt you for a different terminal type in which it's possible for lynx to format its output; however, in this case, lynx's request for a new terminal type will go in the bit bucket with everything else, you won't see it and lynx will just sit there waiting for your response.

²⁶ If you have the Perl documentation on your machine, typing info perlre will get you all the information you need to complete this exercise. Alternatively, you can obtain a copy of Programming Perl [Wall, Christiansen, and Orwant2000] or search the web for online Perl documentation.