Thursday, August 1, 2002

Spiders in the Web

Jo picked me up at work yesterday so we could search for some plumbing supplies on the way home. Our house in Little Compton is out in the country and so we have septic system for treating wastes and a well for our water. The water from our well is hard (it contains a lot of dissolved mineral salts) and acid (it has a high pH). The acidity is particularly problematic as the acid in the water tends to eat through copper pipes; if you install new copper water pipes in a house and don't treat the water, within a couple of years you're likely to find pipes with little pin-hole-sized leaks or whole sections of pipe with paper-thin walls ready to burst at the least provocation.

The fellow from whom we bought our house installed an acid neutralizer and a water softener to avoid having to replace the pipes every couple of years. I guess it works because our pipes haven't burst; however, maintaining the equipment has been a pain. The latest problem is that the plastic pipe used to carry away the waste water from the acid neutralizer burst flooding the basement. The first time it did this was right after a rather severe rain storm and we thought that the water in the basement was due to the storm. The water conditioning equipment is on a timer set so that it performs maintenance on itself, flushing salts and residues, late at night when no one is using the water. Indeed the maintenance program for the acid neutralizer is set to run every other night at around 2:00AM. But I know this and didn't suspect the acid neutralizer when we first found the water in the basement.

That first morning after the rainstorm, we mopped up the water and thought about what we might do to prevent similar flooding in the future, looking at the foundation near where the influx seemed to have originated and thinking about all sorts of solutions. The problem with finding leaks after the source has dried up is that water seeks the lowest point and so, depending on the topography of your basement floor, the location of the water left after the flood may bear little or no relationship to the location of the original source. And the rainstorm was such an obvious and convenient explanation.

Then two days later it happened again. This time the flooding was worse than before and there was no rainstorm to blame. Without a convenient explanation we had to look a little more carefully. On finding water on a ledge near the salt neutralizer we came up with a hypothesis which we tested by manually cycling the neutralizer's maintenance routines. Sure enough, water water gushed from a hole in the plastic pipe leading from the neutralizer to the drain. We patched the pipe but two days later we were back mopping up the basement and looking at several new cracks and holes in the plastic pipe. It seemed like the time to install a new pipe.

We went to our local country hardware store where the very helpful gentleman told us that the plastic pipe on our acid neutralizer no longer satisfied the building codes for exactly the reason that we had encountered: the acid in the waste water reacts with the plastic and renders it brittle and easily compromised. Unfortunately, he didn't have the sort of flexible plastic pipe that we needed though he was reasonably confident that we could find it at a plumbing supply store in Providence or Fall River. And thus our quest on Wednesday evening to find a source for acid resistant flexible pipe.

(It's early morning and I'm sitting on the deck using my laptop connected to my house LAN through a wireless connection. It's going to be in the 90s and humid in Providence but here at the beach it's cool and the sea breezes will probably keep us comfortable. I couldn't remember what pH was and so I fired off a search to Google that came up with, among many other interesting things, a pointer to a fascinating tutorial on acids, bases and the pH scale. I now know that pH measures the ability of a reagent disassociate water into ions. According to the theory of Arrhenius, an acid is a substance that ionizes in water to yield hydrogen ions (H+), and a base is a substance that ionizes in water to yield hydroxide ions (OH-). A little knowledge is a dangerous thing and I don't claim that the theory of Arrhenius is the last word on acids and bases. However, from what I could glean in my quick reading of several web pages, Svante Arrhenius (winner of the 1903 Nobel Prize in Chemistry his electrolytic theory of dissociation) was a very interesting fellow and pursuing the history of the person and his theory could be an interesting diversion. I learned that pH is a measure of the concentration of hydrogen ions, notated [H+], in a solution containing water and the substance being testing. pH is defined as the logarithm base 10 of [H+]. A neutral solution containing equal concentrations of H+ and OH- ions has a concentration of [H+] = 10⁷M and hence a pH of 7 (concentrations are measured in "moles"). Anything with a pH over 7 is classified as an acid and anything with a pH of less that 7 is classified as a base. I was the sort of kid who asked questions about everything and I still do; the web is often the first source I turn to for answers, though it is becoming increasingly difficult to find reliable, informed sources. I am grateful to the Chemistry Department at the University of British Columbia (in one of my favorite cities, Vancouver, British Columbia, Canada) for making their course ware available to the general public.)

We first went to a big plumbing supply store on the outskirts of Providence. We had a sample of the plastic tubing that had deteriorated and thanks to the country hardware store sales person we had some idea of what to ask for. The man at the counter of the plumbing supply knew what we needed but once he determined that he didn't have it in stock he basically turned off and ignored us - no sale, no need to continue the conversation. He wasn't helpful in suggesting places where we might find what we needed and he had no interest in thinking up alternative ways of solving our problem. The fellow in the country was full of ideas for solving our problem some of which involved purchases from his store but all of them helpful. After the plumbing supply store, we had a similarly disheartening experience at another store before giving up.

On the drive home, Jo and I remarked on how some people just love to help out, offer their experience and expertise and try to solve other people's problems; they just seem wired this way. It certainly doesn't have anything to do with whether you live in the country or in the city. I don't know what sort of upbringing or life experiences encourage such an outlook but I do feel that computer science as a particular discipline and academia more as a mind set than as a particular environment is conducive to attracting and producing such people. I think this especially true right now when computers and computational ideas more generally are just beginning to realize their potential. The field attracts problem solvers, people trying to explain phenomena that have previously eluded satisfactory explanation and people who like to share ideas because sharing and collaborating is key to building software.

Teaching and learning are everywhere and always interchangeable. In a field in which the subject matter and problems of interest bleed into every conceivable discipline, the lines between teachers and students are constantly blurred. Respecting the experience of others and learning to explain without condescending is essential to working together. It's a great field for someone like me who loves to ask questions and finds satisfaction in learning useful information so I can answer the questions posed by others.

I don't know where I was going with that last bit. Chalk up the high-minded, soap box verbiage to low blood sugar which I hope has now been taken care of by a little breakfast. What I really wanted to write about today concerns the use of explanatory models and the research that one of my graduate students, Joel Young, is working on. It's not surprising that this is on my mind as Joel and I have been preparing for his thesis proposal for some weeks now and tomorrow, Friday, is the big day when he makes his presentation to the faculty.

Joel's thesis concerns whether or not it is possible build a piece of software called a spider or bot (a disembodied robot sometimes referred to as a knowledge robot or knowbot - a term coined by Robert Lucky while at Bell Labs and described in his book "Silicon Dreams" - or nobot or just bot) that can search the web for pages that satisfy a particular query. There is already software that does this, e.g., Google and AltaVista, but Joel has in mind very different sort of software. Google and AltaVista are so-called big-iron search engines (the word "big-iron" used to be reserved for large mainframe computers, but for some applications these traditional "big-iron" computers are being replaced by "farms" of hundreds of personal computers designed to work together). Google has thousands of computers working to search through billions of web pages; Google finds stuff by exhaustive search, first collecting and then sifting through the entire contents (or at least a large portion) of the web.

Joel is interested in bots that are able to find stuff (answer queries) by searching through only a very small portion of the web. They operate not by exhaustively searching the web but rather by learning to sniff out promising paths to follow and read navigational cues embedded in the links and nodes of the web graph (see the July 17, 2002 entry concerning the web graph). Unlike real spiders they don't spin webs of their own or lay traps for unsuspecting insects. But Joel's spiders are rather like real spiders in that they are light and nimble and can scurry across the links of the web graph using a minimum of computational resources.

Whether or not Joel can build spiders of the sort he's set out to will depend on properties of the web graph. In particular, Joel is assuming that a spider can learn to follow promising paths and read useful navigational cues by only looking at a small portion of the web called a training set. Here's how one of Joel's bots might work. Suppose you want to find web pages that have links to both twentieth century philosophers and pop musicians and suppose you have some examples of pages that you want, e.g., pages that point to both Destiny's Child and Soren Kierkegaard, and pages that you'd just as soon not see again, e.g., pages that point to Marilyn Manson and Friedrich Nietzsche.

You'd give these positive and negative examples to the spider and tell it to find more positive examples. First it would search the web around the positive and negative examples like a bloodhound learning to follow a particular scent. This is trickier than it might seem at first blush and indeed the "big-iron" search engines come in very handy at this stage. Google lets you find pages that link to a given page (click on the "Advanced Search" link on the Google home page and scroll down to "Page-Specific Search" - it's a little eerie asking for pages that point to your own home page). Google can do this because it has explored a large portion of the web and so for almost any page you can think of it has probably visited some if not all of the pages that link to that page.

Once a spider has explored the web around the example pages it uses the information it has gleaned to formulate the idea of a target page (a generalization which attempts to include the positive examples while excluding the negative ones) and, more importantly from the standpoint of Joel's research, a strategy for looking at a set of pages and deciding which one to explore next. Armed with this knowledge which it has learned from the examples, the spider sets off looking for additional target pages.

At any given point in time, the spider will have a set of pages that it has visited (and hasn't thrown away as yet - remember that the spider has only a limited amount of memory so eventually it has to "forget" pages by throwing away whichever ones it deems least useful) and a set of pages that it has found links to but hasn't yet explored (here again the spider will have to work within its memory limitations). This latter set is called the "fringe" and it represents the limits or boundaries of the spider's knowledge. The spider uses what it learned from exploring the web around the positive and negative examples, to choose a page from fringe to explore that it thinks is likely to lead quickly to finding a target page.

Exactly how the spider learns to choose a page from the fringe to explore next is the subject matter of a field called machine learning that concerns computer scientists, statisticians, and researchers in a host of other disciplines. Suffice it to say here that there exist algorithms that can learn very useful aptitudes when provided with an appropriate set of training examples. These algorithms require not only a set of examples but typically they also require a language or set of features for characterizing the examples and expressing whatever preferences or actions are appropriate for the sought-after aptitude. For example, if you were learning to distinguish arctic char from atlantic salmon you'd have to add terms to your vocabulary to describe variations in fish scales, fin type and placement, and iridescent coloring.

In the case of learning to choose a page to explore, the learning algorithm might simply take two pages and indicate which one is likely to lead more quickly to target page. Alternatively it might assign each page a numeric value with higher values indicating pages that are more likely to lead quickly to a target page. Explaining some of this is difficult without the language of statistics but the basic principles are pretty straightforward. The language for characterizing pages might include features referring to the presence of particular words (is the word "philosopher" present in the text of the page) or attributes of the page that can be gleaned from the URL or the surrounding pages (does the domain name for the server indicate a company, an academic institution or perhaps an organization with a particular political agenda).

Coming up with a good language to support a particular sort of machine learning is generally pretty hard, but once you have a such a language, learning can, at least in some cases, be relatively straightforward using modern machine learning tools. I forgot to mention that part of Joel's thesis is concerned with whether or not it is even possible to build spiders of the sort we envision. If it happens to be the case that every subgraph containing pages that are pointed to by both philosophy and popular music pages, looks nothing like every other subgraph containing such pages - there are no discernible navigational cues that generalize across these subgraphs - then we're hosed (slacker/hacker jargon for "we're in an extremely unfortunate situation"). But Joel and I think it will be possible make such generalizations to support directed search (in contrast with aimless link following) for a wide range of interesting targets.

One of the things that Joel and I have been thinking about in preparation for his proposal presentation concerns how best to characterize the "decision problem" faced by the spider in choosing the page to explore next. I'd like to try to explain our thinking to you, though I realize that the subject matter is perilously close to what it is I do and hence there is a danger that I will lapse into jargon or head off on tangents even more tangential and ridiculously peripheral than some I've investigated in earlier entries. (What does "pH" have to do with computer science anyway?) But I'll risk it! (What do I have to lose?) Even if I delete this entry, I'll still be better prepared for Joel's presentation tomorrow. (The student's advisor, namely me, is just as much under scrutiny as the student, Joel, in these academic presentations; after all, I was the idiot who suggested the topic and encouraged Joel to work on it.)

There are static decision problem in which you have a single choice to make that does not depend on subsequent choices and then there are dynamic decision problems in which you expect to make a series of decisions in which earlier decisions will affect those that follow. The decision problem faced by the spider is a dynamic one; if the spider decides to explore one page in the fringe, it will end up with different alternatives to explore than if it had explored some other page. A particularly ill-advised sequence of decisions could transport the spider into a veritable web wasteland, devoid of pages on philosophers, popular music icons, or anything else of conceivable use in searching for the target pages.

Mathematicians and people interested in automated planning and control have devised various models characterizing the predicament that planners or decision makers or, even more generically, agents find themselves in. One of those models is the dreaded Markov decision process or MDP. (The "dreaded" part is a tongue-in-cheek allusion to the view held by some of my colleagues that I see absolutely everything as a Markov decision process. It's not true, but there have been times during my research career in which almost everything I published was concerned with MDPs.)

In a MDP, the agent can be in one of several "states". There could be ten, one hundred, one million or even an uncountably infinite number of states but we usually assume that there are a finite, though possibly large, number of them. When in a state, the agent can choose to perform one of several actions, where again we generally assume that there are a finite but, in the case of actions, relatively small number of them. Finally, we assume that the agent receives some sort of reward or punishment when it arrives in a state. In the case of a spider searching the web for target pages, the spider gets a reward for finding a target page.

A model is an "eye-of-the-beholder" sort of a thing. A good model emphasizes those aspects of a problem that the modeler thinks important and deemphasizes, or abstracts away from, those details considered by the modeler to be irrelevant to the problem. Models are used to predict behavior, as in the case of a model of the solar system that predicts the orbits of the planets, or explain behavior, as in the case of my, admittedly weak, model for how acid degrades plastic pipe thus explaining how my basement was flooded. Models can also be used to diagnose faults in computer hardware and software or identify diseases in biological organisms.

A decision model can be used to plan out a sequence of actions or it can be used as a means of gaining insight into why a particular decision problem is easy or hard. It's this last application of models that I'm primarily concerned with here. To completely specify a decision model for the spider problem, we have to say what the states and actions are; in some sense the choice is arbitrary since it's my model but the choice will have consequences. For example, suppose that we say that a state corresponds to the spider being at a particular web page in the sense that you are at a particular web page when you're browser is displaying that page. And suppose an action corresponds to traversing a link leading out of the page corresponding to the current state.

OK, what are the consequences of this model? Well, it's fine except that if we assume a spider will have to learn this model in order to find target pages, then our spider will have to have the computational resources of a Google or AltaVista, and we've said our spiders are computational light-weights. There's another problem that at first seems a mathematical detail but later turns out to have more profound consequences. We imagine our spider getting a reward for finding a target page the first time, but if our spider keeps returning with the same page, like a dog that keeps returning a stinky dead animal carcass that you keep trying to throw in the bushes, then you're not going to be very happy.

How do we stipulate that a spider only gets a reward the first time it returns a target? Well, it's easy to stipulate but any way you stipulate it, the resulting model violates the formal mathematical model for a MDP. "Who cares?" you say, "Let the mathematics be dammed; I just want to represent the problem the way it really is." The problem is that while the formal model carries along some baggage, in the negative sense of constraints that make it difficult to apply the model to real stuff, it also carries along some valuable freight, in the sense of powerful mathematical tools for analysis which only apply if you satisfy the constraints. In a Markov model, the states, actions and rewards can't change. If you have a formulation in which they do change, then you end up with "non-stationary" model which, as its name might suggest, is tough to "hit" with existing mathematical tools.

In response to this problem with our simple model, some researchers have formulated an alternative model which is "stationary", satisfies all the formal requirements for a MDP, but has some problems of its own. In this model (due to Andrew McCallum and Jason Rennie) the states describe the what the spider has and has not yet visited; at any given time the spider has visited a set of pages and has all of the remaining pages in the web left to explore. A state, S = (V, U), is simply a division, formally called a partition, of the nodes in the web graph into those that have been visited, V, and those that haven't, U.

This is a perfectly coherent account of states in a MDP sense. An action consists of choosing a link out of some visited page to some as-yet-unexplored page. In this model, it's possible for the spider to transition from one state, S₁ = (V₁,U₁), to another, S₂ = (V₂,U₂), where V_n is the set of visited pages for S_n and U_n is the set of unexplored pages for S_n, if V₂ is equal to V₁ plus exactly one page in U₁, call it P, and U₂ is just U₁ minus P. In addition, the spider gets a reward for visiting S₂ just in case P is a target page. This model satisfies the formal requirements for a MDP; unfortunately, it's not terribly satisfying in other respects.

It satisfies the property that rewards can't change conveniently enough but it satisfies the requirement that the spider only gets a reward when it visits a target page the first time by making it impossible for the spider to visit the same state twice. It is a MDP however and McCallum and Rennie were able to use the model to explain their learning algorithm and justify the use of some relatively sophisticated mathematical machinery from a subfield of machine learning concerned with "reinforcement learning". Reinforcement learning is learning in which the learner is given feedback in the form of rewards and punishments when it attempts to answer a question or solve a problem; this is in contrast with learning in which learner is provided with the answer to the question or the solution to the problem. Joel and I weren't particularly happy with McCallum and Rennie's model from an explanatory standpoint and have been working on alternative formulations of the learning problem. I'll try to articulate our model tomorrow on the bus.