Tuesday, September 10, 2002

Darwin's Dangerous Algorithm

Albert, Aron, Eliot, Erika, Jennifer, Leah, Luke and Susannah were important in inspiring me to work on this journal. Indeed, my conversations with students over the years have been the highlight of my academic career. I'm indebted to hundreds of students who have challenged me, made me think differently about things I thought I understood, and, in diverse and often dramatic fashion, demonstrated to me the value of different approaches to learning and teaching. Back in July, I exchanged email with a former student now a graduate student at the University of Washington. That exchange got me thinking back to a learning experience that she and five of her fellow students became involved in when they were sophomores at Brown. An experience that taught me a lot about working with students in directing their own educations.

In the Fall of 1996, six students, Ana-Maria, Andrew (or Shoe as he liked to be called), Brandon, Brian, Gideon and Sam, came to me asking if I'd serve as a faculty sponsor for a GISP (Group Independent Study Project) in the following spring. The proposed subject matter of the GISP was all over the map; they wanted to pursue topics and connections among topics including evolution, natural selection, computational complexity, and combinatorial optimization. In particular, they wanted to explore the field of genetic algorithms and genetic programming. I was skeptical of both the material and the value of spending an entire semester studying it. From my perspective, genetic algorithms and the related genetic programming are simply methods for searching in very large hypothesis spaces and not terribly effective methods at that. I expect these terms, hypothesis space, hypotheses, search, are unfamiliar; so let's back up a bit and proceed at a more leisurely pace.

Depending on the context, the term hypothesis space might refer to a set of possible answers to a posed question, a set of possible explanations for a given phenomenon, or set of possible solutions to a specified problem. The answers, explanations and solutions are referred to generically as hypotheses. In many computational problems, you proceed by looking at or searching through a sequence of such hypotheses. If you're lucky, the hypothesis space has some inherent structure or regularity, so that having examined one hypothesis and found it wanting, you can extract clues about how to transform the current hypothesis into one or more alternative hypotheses that might, just might, correspond to better answers, explanations or solutions. For example, in the traveling salesman problem (see the September 2^nd entry for a description of the traveling salesman problem and other hard computational problems), a hypothesis is a tour that visits all of the cities in which the salesman has customers. It may be that by looking at a given tour you can identify features, e.g., portions of the tour that have the salesman ineffectively bouncing back and forth across the country, that suggest how to create alternative, shorter tours. Since the traveling salesman problem is in the class of NP-complete problems (the decision-problem version of the problem anyway), these features and accompanying hints on how to improve tours don't always result in improvements but they often perform well on some problems occurring in practice.

Genetic algorithms constitute a general method (framework, really, as I'll explain shortly) for searching in very large hypothesis spaces. The "very large" qualifier is added because if the space is small enough, then we can "search" it exhaustively by simply enumerating every hypothesis and picking out the best one. Such general search methods can be used for solving so-called combinatorial optimization problems (problems in which the component pieces of solutions, e.g., the individual city-to-city hops that comprise a tour, can be assembled or combined in a large number of ways), learning problems like the spam classification problem that we looked at on August 30^th, and automatic programming problems like synthesizing a robot program to guide a robot through a maze (wouldn't it be nice if you could get a computer to write all your programs for you). Indeed, genetic algorithms are sometimes scathingly referred to as the second-best method for solving just about any problem.

So I was skeptical but I got caught up in their enthusiasm. Genetic algorithms (GAs) are interesting not only for what they can do, e.g., solve hard combinatorial optimization problems, but also for how they do it. GAs simulate in a rough way the process of evolution and natural selection. GAs operate on whole collections or populations of hypotheses encoded as strings of digital DNA (data structures that programmers often think of in terms of genes, codons, and other terms familiar to molecular biologists). Hypotheses are subjected to simulated mutation and the "fittest" hypotheses are allowed to "reproduce" resulting in offspring that combine features of the best hypotheses. Once you get into the spirit, you can borrow just about anything you like from evolutionary biology. You could even incorporate ideas from currently unpopular theories such as those of Jean-Baptiste Lamarck (1744-1829) who proposed that evolution is directly influenced by the experiences of individual organisms over their lifetimes. You're writing the simulation so you have a great deal of freedom. (It's a bit like playing god which might explain some of the attraction of the field.) More often than not, however, programmers tend to follow nature's suggestions. It seems plausible that nature stumbled upon some pretty efficient ways of searching in very large hypothesis spaces, otherwise how do you account for us; we certainly think we're pretty cool.

They had me reading everything: Richard Dawkins's "The Selfish Gene" [Oxford University Press, 1989], James Watson's "The Double Helix: A Personal Account of the Discovery of the Structure of DNA" [W. W. Norton and Company, 1980], Robert Axelrod's "The Evolution of Cooperation" [Basic Books, 1984], John Maynard Smith's "Evolution and the Theory of Games" [Cambridge University Press, 1982] and "Evolutionary Genetics" [Oxford University Press, 1989], just to name a few of the books that ended up on my night-time reading table. Genetic, molecular and evolutionary biology provide a wealth of theory to serve as the basis for heuristics to assist in guiding search. We were looking at anything that might make our algorithms better or perhaps supply some missing piece thereby enabling us to tap the power and mystery of natural selection. Why settle for some ho-hum low-fidelity version of simulated evolution when you can have a super-charged high-fidelity version with all the bells as whistles?

In addition to all the readings in biology, we also read Melanie Mitchell's "An Introduction to Genetic Algorithms", [MIT Press, 1996] which is very readable introduction to the field. I had met Melanie two years earlier during a sabbatical stay at the Santa Fe Institute and she managed to dispel some of my skepticism. During my stay in Santa Fe, I gained grudging respect for what a good programmer could do with GAs. It's important to point out, however, that GAs aren't a ready-made solution for any particular problem; they offer a general framework for solving problems but there are lots of pieces that a programmer has to supply. Let's consider how we'd go about trying to solve an instance of the traveling salesman problem using GAs.

The first thing you need to think about is how you're going to represent a hypothesis, a tour in the case of the traveling salesman problem (TSP). I don't have time for an introduction to genetics so we'll dispense with the biological terminology. For our purposes, a string of digital DNA corresponds to a list of simple tokens. For the tokens in our TSP instance, we'll use the standard three-letter codes for the airports in six cities: BOS for Boston, BWI for Baltimore, DFW for Dallas Fort Worth, ORD for Chicago, PVD for Providence and SFO for San Francisco. A tour then is a list of six tokens with the assumption the last leg of the tour involves a flight from the city corresponding to the last token in the list back to the city corresponding to the first token in the list. Here's an example tour, (BOS PVD BWI SFO ORD DFW), Boston to Providence to Baltimore to San Francisco to Chicago to Dallas Fort-Worth and, finally, back to Boston. We'll also need a matrix representing the distance between each pair of cities.

      BOS   BWI   DFW   ORD   PVD   SFO
BOS   0000  0371  1550  0867  0043  2704
BWI   0228  0000  1100  0621  0328  2400
DFW   1550  1100  0000  0802  1503  1464
ORD   0867  0621  0802  0000  0849  1846
PVD   0043  0328  1503  0848  0000  2654
SFO   2704  2400  1464  1846  2654  0000

Using the above distance matrix, we can calculate that our example tour has length of 6969. The first part of the tour makes a certain amount of sense; we travel down the eastern seaboard from Boston (BOS) to Providence (PVD) to Baltimore (BWI). But then we shoot over to San Francisco (SFO), back across the country to Chicago (ORD), and down to Dallas Fort Worth (DFW) before flying back to Boston. If we were to reverse the order of tokens corresponding to Chicago and San Francisco in the original tour, we'd get the following tour, (BOS PVD BWI ORD SFO DFW), of length, 5852, a significant improvement. You might imagine two tokens swapping positions as being a mutation of sorts. And you might implement this form of mutation using a subroutine for generating random numbers. Using such a subroutine, it is very easy to write a program that examines a list representing a tour and, with a given fixed probability, for each consecutive pair of tokens in the sequence, either reverse the order of the two tokens or leave them as they were. Such a program is called a genetic operator in the parlance of genetic algorithms. Is this form of mutation biologically plausible? Perhaps but it doesn't really matter; we're calling the shots and we're free to slavishly emulate natural selection to the extent that we understand it or depart from reality as we see fit.

Other forms of mutation might not work out so well for our purposes. If our mutation routines were free to substitute any one token for any other token with some probability, that might result in a tour visiting the same city twice or, worse, inserting a city that we didn't have to visit in the first place. Luckily we can tame natural selection to avoid generating genetic misfits, or, in our case, tours that lead us in circles or to destinations we'd rather not visit.

We can also arrange for programs that simulate sex and thereby enable hypotheses represented as strings of digital DNA to share their genetic heritage. Our current winner, (BOS PVD BWI ORD SFO DFW) (length 5852) has a good east coast solution and a so-so finish. But consider, (BOS BWI PVD DFW SFO ORD) (length 6379); it wanders up and down the east coast but takes a nice swing down through the south (DFW) over to the west coast (SFO) and then back across the country stopping in Chicago (ORD) before returning to Boston (BOS). Suppose these two were to get together and swap their best attributes. We break (BOS PVD BWI ORD SFO DFW) into two equal parts, (BOS PVD BWI) and (ORD SFO DFW), and the same for (BOS BWI PVD DFW SFO ORD) resulting in (BOS BWI PVD) and (DFW SFO ORD). Now we swap parts forming two new sequences: (BOS BWI PVD ORD SFO DFW) of length 6408 which ain't so great and a second, golden child, (BOS PVD BWI DFW SFO ORD) of length 5648, the best we've seen so far. Does this have any analog in natural selection? Yes, and in the terminology of genetic algorithms it's called a crossover genetic operator¹. Unlike the simple mutation operator we discussed earlier, crossover operators require two strings of digital DNA to tango. So, how do we simulate sex and reproduction in a genetic algorithm?

First of all, note that we're quite clear about what it means for tours, corresponding to strings of digital DNA, to be improving; shorter tours are more desirable and we're searching for the shortest possible tour. In natural selection, individuals are not necessarily getting "better" even on average; natural selection simply selects for genes (which are basically programs for building individuals or pieces of individuals) that manage to do a good job at reproducing themselves. If such genes give rise to individuals that are stronger or smarter, then that's interesting but it needn't be the case. It's certainly possible that natural selection might stumble on a program for building small, fast, mindless individuals that are superior to us in the sense that they reproduce so effectively that they wipe us off the planet. But, as the designer of genetic algorithms, we are free to play fast and loose with what we know about natural selection. In particular, we get to define the criterion that governs who gets to reproduce and how much.

Genetic algorithms operate by analyzing a set of hypotheses referred to as a population, a set of tours in the case of TSP. At any point in its operation, a GA will have a current population that it analyzes in order to determine which entities in the population will get an opportunity to reproduce. Using the results of its analysis, the GA then constructs the next generation by mutating and pairing selected entities from the current generation. The generation so produced becomes the current generation and the GA repeats this process. If an entity is found in a generation that meets some specified termination criterion, the GA will halt and return the entity, otherwise the GA continues until it has examined a specified number of generations. Exactly how entities are selected for reproduction is at the heart of designing a GA.

It shouldn't be hard to convince yourself that not all mutations and crossovers will result in improvements and indeed in our crossover example only one of the two new tours was an improvement on its parents. We probably don't want to give all entities an equal opportunity to reproduce nor can we expect that all of the entities resulting from reproduction will be equally desirable for future reproduction. We use the term fitness to refer to our measure of how desirable or fit a string of digital DNA is for reproduction. The higher the fitness of a string the more likely it will get selected for reproduction. Since shorter tours are better than longer ones we want to choose a fitness function that's inversely related to tour length. Since we're interested in conferring a reproductive advantage within a given population we can focus on the entities within a population but, as I'll explain shortly, this can result in shortsightedness. Let's define the fitness function for a tour to be the difference between the tour and longest (worst) tour in the population. With this definition the longest tour is the least fit, it has a fitness of zero, and the shortest tour is the most fit. How can we use this measure of fitness to select tours for reproduction?

Randomness seems to play a role in natural selection. Occasionally fit entities fail to reproduce and not-so-fit entities produce more than one would expect. Randomness helps nature to avoid heading down a reproductive dead end. In GAs, the process of selection is carried out like a lottery in which fit entities get more lottery tickets than less fit entities but even a not-so-fit entity can win the jackpot and reproduce a whole gaggle of offspring. Typically the probability that an entity will be allowed to reproduce is proportional to its fitness. In Scheme, we might calculate the fitness and reproductive probability of the entities in a population as follows (see the July 15^th entry for a discussion of functional programming and the map operator in particular).

(define (distribution population)
  (let* ((worst (apply max (map tour-length population)))
         (fitness (lambda (tour) (- worst (tour-length tour))))
         (total (apply + (map fitness population)))
         (probability (lambda (tour) (/ (fitness tour) total))))
    (map probability population)))

This is a simplified version of what's done in practice. In practice, a distinction is typically made between individuals (referred to as phenotypes) and classes of individuals all of whom share the same genetic code (or genotype). For example, in a population of TSP tours, it could be the case that the same tour appears multiple times. In designing a GA, you have to think about not only the fitness of a particular code but also about the number of times that the code appears in the population. Having more than one individual with the same code, provides some insurance that a particularly fit code won't get lost due to random selection. But we won't worry about such fine points here. Let's consider the tiny population corresponding to the five tours that we've mentioned in our conversation so far. Here's the distribution corresponding to this population (technically, the distribution corresponds to the five numbers in the last column which should sum to 1.0 but don't because they've been rounded for readability to the nearest 0.001).

> (display-distribution (distribution initial-population))
(BOS PVD BWI SFO ORD DFW) - Length: 6969, Fitness: 0000, Probability: 0.000
(BOS PVD BWI ORD SFO DFW) - Length: 5852, Fitness: 1117, Probability: 0.311
(BOS BWI PVD DFW SFO ORD) - Length: 6379, Fitness: 0590, Probability: 0.164
(BOS BWI PVD ORD SFO DFW) - Length: 6408, Fitness: 0561, Probability: 0.156
(BOS PVD BWI DFW SFO ORD) - Length: 5648, Fitness: 1321, Probability: 0.368

Given the above distribution, how would we go about picking the next generation? A typical GA might proceed by choosing a set of entities in accord with the above distribution. This means that, on average, there would be 31.1% entities with (BOS PVD BWI ORD SFO DFW) for a code, 16.4% entities with (BOS BWI PVD DFW SFO ORD) for a code, and so on. These entities would then be paired, individually subjected to mutation, pair-wise subjected to crossover, and perhaps modified by other genetic operators as well. The resulting mutations and sexual combinations would then comprise the next generation perhaps with some culling out (again with random variation) to produce a generation of size equal to that of the previous generation. That's basically it; the GAs produce generation after generation of entities until they stumble on an entity corresponding to whatever we were searching for, a short tour in the case of TSP.

There are endless variations and often enough lessons from biology can point the way to improved GAs. The above distribution assigned zero probability to the first tour that we looked at. It could be however that an otherwise uninteresting entity has a nice solution to some small part of the overall problems that isn't otherwise represented in the population. It that entity isn't given an opportunity to reproduce, then that piece of the overall solution could be lost. Crossover operators have the property that they can, at least occasionally, construct from two inferior entities an entity of high quality and high fitness.

A problem facing the designers of GAs as well as the designers of search algorithms concerns when to stop. We typically don't know the length of the shortest tour or, more generally, what the best hypothesis looks like, and so GA designers have to guess when to stop. For our little problem involving six cities we can check by performing an exhaustive search of all possible tours.

> (display-optimal-tour (exhaustive-search cities))
(PVD ORD SFO DFW BWI BOS) - Length: 5530

We weren't far off from an optimal solution, but, in general, there is no computationally tractable way for us to know this. In this case, there are only 6! = 6 * 5 * 4 * 3 * 2 * 1 = 720 possible tours. But the factorial function increases faster than any exponential function (e.g., compare 10! = 3628800 with 2¹⁰ = 1024 or note that 100! has over 150 digits while 2¹⁰⁰ has only 31). You might think you could just keep track of statistics on how the entities in populations change over time. If the most fit entity in the last twenty populations, has the same fitness within some small tolerance, then perhaps you can't get any better.

Unfortunately, GAs like other search algorithms have a tendency to get stuck on "fitness plateaus" where the entities in populations are all just about the same in terms of fitness and in terms of the availability of useful variations corresponding to pieces of the sought-after solution. The entities in these populations don't have any incentive nor do they have any basis to evolve. In some cases, the GA designer can borrow from nature to create such an incentive and inject variation into a moribund population. You can create complex simulated environments and determine fitness by having entities fight for their survival in these environments. Some GAs make use of the digital analogs of parasites, symbiotes, competition, predators, and catastrophic events that can serve to spur evolution and encourage novel solutions.

Ana-Maria, Shoe, Brandon, Brian, Gideon and Sam studied the literature, invented new algorithms and reinvented old ones, struggled to take the general GA framework and apply it to particular problems, and they wrote code and performed lots of computational experiments. They were elated at times and frustrated at others. Sometimes the way seemed so clear and at others the mountains of books and relevant papers must have seemed pretty daunting. On the whole, I think the experience was positive both as a life experience and in terms of deepening their understanding of computer science. They discovered that genetic algorithms are not a panacea for solving hard computational problems. They learned that the metaphors and lessons from biology can be useful, but that there are other metaphors and mathematical approaches that are just as useful. And they discovered that nothing substitutes for really understanding the computational problem that you're trying to solve; there is little value in blindly applying GAs (or any other search method for that matter) to solving hard computational problems. I was certainly proud of what they achieved and how they went about it, working together as a team and striving to excel as individuals.

I can understand why they were attracted to the general area and to GAs in particular. Whether you buy into the metaphors and methodology of GAs or chose some other approach to searching very large hypothesis spaces, some method of simulating Darwin's dangerous idea is likely to prove useful in building intelligent machines. And keep in mind that simulated evolution is accelerating at an exponential rate keeping pace with advances in computing hardware. I'd be remiss if I didn't point out that, in addition to solving combinatorial problems with simulated evolution, scientists studying natural evolution use computers to simulate their theories thus providing interesting insights into biological systems. I especially like the accounts of such computer studies in Matt Ridley's "The Origins of Virtue" (on understanding the evolutionary basis for altruism and cooperation) and "The Red Queen" (on understanding the advantages of sex in reproduction).

1. The crossover example conveniently used two tours whose first halves and second halves contained the same cities, BOS, PVD and BWI in the first halves and DFW, ORD and SFO in the second halves. Clearly it's possible to choose a pair of tours that doesn't have this property. The crossover operator for TSP has to be designed to avoid creating tours with multiple occurences of tokens corresponding to the same city. It's an interesting exercise to design such an operator.