Sunday, August 4, 2002

Fishing Again

I mentioned on Thursday the task of learning to distinguish between arctic char and atlantic salmon. If you were learning to "sort fish" in a fish packing factory (cannery), you'd learn pretty quickly that some observable characteristics are shared by both kinds of fish, others are not terribly useful or may even be misleading in trying to make distinctions, and some are perfectly suited to distinguishing one type of fish from the other. After a while, you'd begin to ignore those characteristics that are misleading or useless for making distinctions and you'd become more refined in taking advantage of the useful diagnostic characteristics.

We seek out diagnostics for making all sorts of distinctions and we seldom waste time making distinctions (checking characteristics) that don't serve any useful purpose. In your fish-sorting job at the cannery, if you started distinguishing between arctic char with orange-red bellies and those with predominantly silver bellies by throwing the fish so distinguished into separate bins, then you'd probably get fired. It's just not necessary to make such distinctions if all you're interested in is cans of char and cans of salmon. In the ridiculously extreme case, a fish sorter run amok might consider each and every fish a separate species and create a separate label for cans containing each individual fish.

We want our spider to classify pages and links so as to assist the spider in navigating quickly to target pages and we want it to ignore all other (irrelevant for our purposes) aspects of web pages. Recall our working assumption about the web namely that fragments of the web surrounding one target page look similar to fragments of the web surrounding other target pages. If this isn't true, then we have no hope of generalizing from looking at the web graph around a few examples to the web graph as a whole. The operant notion of similarity in this case concerns that presence of navigational cues, specifically in our running example, navigational cues pertinent to searching for pages pointed to by both philosophy and pop music fans.

Why should different fragments of the web have the same structure and available navigational cues? Because the human authors who produced these pages and linked them together think alike, have similar tastes, and use similar tools (see references to small-world phenomena in the July 17, 2002 entry). OK, possibly for some fragments but what about the rest of the web? Won't most of the web be different from those fragments influenced by pop music and philosophy? Yes, but other portions of the web will have their own replicated structures and cues influenced by their own communities of authors who share other interests, tools and techniques.

What if we group together or "cluster" pages identified with topics and communities? We distinguish among clusters other than those concerned with pop music and philosophy only in so far as such distinctions allow the spider to make better navigational choices in zeroing in on target pages. There may be clusters of pages which are devoid of useful navigational cues in which case we'd throw all these clusters into one mega cluster and when the spider finds itself in such a page, the appropriate navigational advice would be simply to "get out" and to get into some other cluster where, hopefully, the spider will find more useful navigational cues.

What if we identify the states of our MDP with these clusters of pages with navigationally similar advice? One immediately advantage is that there would be a much smaller set of states, unless the spider runs amok in making distinctions like our over-zealous fish sorter. Note also that because a state is really a set of pages it's possible for the spider to perform an action and remain in the same state but a different page. This is fine and indeed quite common. For instance right this moment I might characterize my present state as my being in my office at home sitting at my desk. Now, a few seconds later, my posture has changed slightly but for my present purposes my state has not changed, I'm still in my office at home sitting at my desk.

The model of the web graph with its billions of pages and links is as useless to a spider seeking specific target pages as a cosmologist's model of the universe based on some whacked-out hyper-dimensional string theory is useless to a real spider seeking a fly trapped in its sticky web. In learning, we make only those distinctions we need to and in this way what we learn often generalizes to situations we haven't yet encountered. This strategy isn't foolproof. It may be that on the basis of the few examples of each type of fish that you've seen so far, you decide to ignore some feature of arctic char, e.g., the way in which their hooked lower jaws protrude, that is crucial in distinguishing arctic char from atlantic salmon in a larger set of examples. This is possible but hopefully statistically unlikely if we choose a large enough set of examples.

So that's the basis for Joel's thesis: that we can build spiders that use a minimum of computational resources to learn to navigate the web in search of target pages. The learning algorithms that he uses learn to predict which pages will lead most quickly to target pages. These learning algorithms don't explicitly construct a MDP model with states and actions and rewards but they implicitly group pages into clusters each of which is associated with a different estimate of the "distance to the nearest target page". And Joel and I can explain the behavior of a spider guided by the predictions of such a learning algorithm in terms of the spider following a model, a MDP, whose states correspond to these clusters, whose actions correspond to following links and whose rewards correspond to finding target pages.

By the way, this implicit learned model still suffers from the problem of non-stationary rewards. There are various ways of fixing the formal model but all of them end up multiplying the number of states to an extent that renders the model impractical. Instead we practice the simple expedient of having "explorer" spiders and "exploiter" spiders. An explorer spider operates best in states corresponding to clusters of pages that don't contain target pages by seeking out such clusters. Explorers follow the predictions of the learning algorithm quite closely as they are seeking out target-rich clusters. An exploiter spider operates best inside of target-rich clusters. Exploiters follow the dictates of the learning algorithm only to make sure that they remain within a target-rich cluster and otherwise search systematically within the cluster to ensure finding all of the resident target pages. (Researchers in reinforcement learning make a different distinction between exploration and exploitation which you may encounter if you read papers in this area.)

After learning the requisite function for choosing between links to minimize the distance to the nearest target, we imagine starting one or more explorers from random starting points in the web and allowing these spiders to run until they find a target-rich cluster. Left to its own devices, an explorer, having locked onto a particular target page, would keep returning to the same page like an ant returning to the same tasty leaf oblivious to the fact that the forest floor is littered with similarly nutritious leaves. Once in a cluster, an explorer would first deploy (spawn off child processes) one or more exploiters and then teleport to some new starting point in the web. Explorers and exploiters would work as a team to maximize the number of target pages returned.

Earlier when I mentioned that the learning algorithm groups pages into clusters each of which is associated with a different estimate of the "distance to the nearest target page", I used scare quotes to set off the phrase "distance to the nearest target page". (Scare quotes are quotation marks placed around a word or phrase that the author feels might be inaccurate or misleading if interpreted literally. Often scare quotes are used to express skepticism, irony or sarcasm.) If you're standing on one side of buggy, snake-infested, quicksand-riddled swamp and your house is on the other side of the swamp, you're likely to conclude that your distance home is not as the crow flies but rather the distance required to navigate safely around the swamp. In this case, the detour is advised by the danger of traversing the swamp.

Alternatively, imagine that there is a deep woods separating you from home and that you would surely lose your way if you were to wander into the woods. For a spider, the shortest distance home is the distance along the shortest path that the spider can successfully negotiate given its ability to extract navigational cues from its surroundings. So a spider considering moving forward into the woods versus moving to the right or the left around the woods would conclude that the either of the latter moves are preferable to getting lost in the woods.

Joel has developed a set of software libraries for building and deploying spiders and experimenting with various learning algorithms. In addition to exploring the wild web (the real web that you access through your browser), Joel can run his spiders in a canned web (the web in a box) that corresponds to several million pages of the real web, part of a snapshot of the real web at some particular point in time. The WT10g collection consisting of over 1.5M pages is one such instance of the web in a box. These so-called web archives have the advantage that they offer a stable environment in which experiments are repeatable, properties can be carefully measured and one researcher can attempt to replicate the results of other researchers.

The real web is a fascinating playground for computer scientists interested in implementing bots using techniques from artificial intelligence. For applications such as searching the web, you don't even need complicated and expensive hardware as you do in the case of more advanced robots and much of the software is freely available. The web is the most complicated human-fashioned technological artifact to ever exist and it's rapidly becoming tied to every aspect of our society. As it becomes more complicated, we will need surrogates to gather data, conduct business transactions, and, generally, represent us and do our bidding on the web. The next few decades promise to be very interesting (and potentially lucrative) for programmers who can build such surrogates.