Darwin's Dangerous Algorithm |
|
|
|
|
Albert, Aron, Eliot, Erika, Jennifer, Leah, Luke and Susannah were important in inspiring me to work on this journal. Indeed, my conversations with students over the years have been the highlight of my academic career. I'm indebted to hundreds of students who have challenged me, made me think differently about things I thought I understood, and, in diverse and often dramatic fashion, demonstrated to me the value of different approaches to learning and teaching. Back in July, I exchanged email with a former student now a graduate student at the University of Washington. That exchange got me thinking back to a learning experience that she and five of her fellow students became involved in when they were sophomores at Brown. An experience that taught me a lot about working with students in directing their own educations.
In the Fall of 1996, six students, Ana-Maria, Andrew (or Shoe as he liked to be called), Brandon, Brian, Gideon and Sam, came to me asking if I'd serve as a faculty sponsor for a GISP (Group Independent Study Project) in the following spring. The proposed subject matter of the GISP was all over the map; they wanted to pursue topics and connections among topics including evolution, natural selection, computational complexity, and combinatorial optimization. In particular, they wanted to explore the field of genetic algorithms and genetic programming. I was skeptical of both the material and the value of spending an entire semester studying it. From my perspective, genetic algorithms and the related genetic programming are simply methods for searching in very large hypothesis spaces and not terribly effective methods at that. I expect these terms, hypothesis space, hypotheses, search, are unfamiliar; so let's back up a bit and proceed at a more leisurely pace.
Depending on the context, the term hypothesis space might refer to a set of possible answers to a posed question, a set of possible explanations for a given phenomenon, or set of possible solutions to a specified problem. The answers, explanations and solutions are referred to generically as hypotheses. In many computational problems, you proceed by looking at or searching through a sequence of such hypotheses. If you're lucky, the hypothesis space has some inherent structure or regularity, so that having examined one hypothesis and found it wanting, you can extract clues about how to transform the current hypothesis into one or more alternative hypotheses that might, just might, correspond to better answers, explanations or solutions. For example, in the traveling salesman problem (see the September 2nd entry for a description of the traveling salesman problem and other hard computational problems), a hypothesis is a tour that visits all of the cities in which the salesman has customers. It may be that by looking at a given tour you can identify features, e.g., portions of the tour that have the salesman ineffectively bouncing back and forth across the country, that suggest how to create alternative, shorter tours. Since the traveling salesman problem is in the class of NP-complete problems (the decision-problem version of the problem anyway), these features and accompanying hints on how to improve tours don't always result in improvements but they often perform well on some problems occurring in practice.
Genetic algorithms constitute a general method (framework, really, as I'll explain shortly) for searching in very large hypothesis spaces. The "very large" qualifier is added because if the space is small enough, then we can "search" it exhaustively by simply enumerating every hypothesis and picking out the best one. Such general search methods can be used for solving so-called combinatorial optimization problems (problems in which the component pieces of solutions, e.g., the individual city-to-city hops that comprise a tour, can be assembled or combined in a large number of ways), learning problems like the spam classification problem that we looked at on August 30th, and automatic programming problems like synthesizing a robot program to guide a robot through a maze (wouldn't it be nice if you could get a computer to write all your programs for you). Indeed, genetic algorithms are sometimes scathingly referred to as the second-best method for solving just about any problem.
So I was skeptical but I got caught up in their enthusiasm. Genetic algorithms (GAs) are interesting not only for what they can do, e.g., solve hard combinatorial optimization problems, but also for how they do it. GAs simulate in a rough way the process of evolution and natural selection. GAs operate on whole collections or populations of hypotheses encoded as strings of digital DNA (data structures that programmers often think of in terms of genes, codons, and other terms familiar to molecular biologists). Hypotheses are subjected to simulated mutation and the "fittest" hypotheses are allowed to "reproduce" resulting in offspring that combine features of the best hypotheses. Once you get into the spirit, you can borrow just about anything you like from evolutionary biology. You could even incorporate ideas from currently unpopular theories such as those of Jean-Baptiste Lamarck (1744-1829) who proposed that evolution is directly influenced by the experiences of individual organisms over their lifetimes. You're writing the simulation so you have a great deal of freedom. (It's a bit like playing god which might explain some of the attraction of the field.) More often than not, however, programmers tend to follow nature's suggestions. It seems plausible that nature stumbled upon some pretty efficient ways of searching in very large hypothesis spaces, otherwise how do you account for us; we certainly think we're pretty cool.
They had me reading everything: Richard Dawkins's "The Selfish Gene" [Oxford University Press, 1989], James Watson's "The Double Helix: A Personal Account of the Discovery of the Structure of DNA" [W. W. Norton and Company, 1980], Robert Axelrod's "The Evolution of Cooperation" [Basic Books, 1984], John Maynard Smith's "Evolution and the Theory of Games" [Cambridge University Press, 1982] and "Evolutionary Genetics" [Oxford University Press, 1989], just to name a few of the books that ended up on my night-time reading table. Genetic, molecular and evolutionary biology provide a wealth of theory to serve as the basis for heuristics to assist in guiding search. We were looking at anything that might make our algorithms better or perhaps supply some missing piece thereby enabling us to tap the power and mystery of natural selection. Why settle for some ho-hum low-fidelity version of simulated evolution when you can have a super-charged high-fidelity version with all the bells as whistles?
In addition to all the readings in biology, we also read Melanie Mitchell's "An Introduction to Genetic Algorithms", [MIT Press, 1996] which is very readable introduction to the field. I had met Melanie two years earlier during a sabbatical stay at the Santa Fe Institute and she managed to dispel some of my skepticism. During my stay in Santa Fe, I gained grudging respect for what a good programmer could do with GAs. It's important to point out, however, that GAs aren't a ready-made solution for any particular problem; they offer a general framework for solving problems but there are lots of pieces that a programmer has to supply. Let's consider how we'd go about trying to solve an instance of the traveling salesman problem using GAs.
The first thing you need to think about is how you're going to
represent a hypothesis, a tour in the case of the traveling salesman
problem (TSP). I don't have time for an introduction to genetics so
we'll dispense with the biological terminology. For our purposes, a
string of digital DNA corresponds to a list of simple tokens. For the
tokens in our TSP instance, we'll use the standard three-letter codes
for the airports in six cities: BOS for Boston,
BWI for Baltimore, DFW for Dallas Fort
Worth, ORD for Chicago, PVD for Providence
and SFO for San Francisco. A tour then is a list of six
tokens with the assumption the last leg of the tour involves a flight
from the city corresponding to the last token in the list back to the
city corresponding to the first token in the list. Here's an example
tour, (BOS PVD BWI SFO ORD DFW), Boston to Providence to
Baltimore to San Francisco to Chicago to Dallas Fort-Worth and,
finally, back to Boston. We'll also need a matrix representing the
distance between each pair of cities.
BOS BWI DFW ORD PVD SFO
BOS 0000 0371 1550 0867 0043 2704
BWI 0228 0000 1100 0621 0328 2400
DFW 1550 1100 0000 0802 1503 1464
ORD 0867 0621 0802 0000 0849 1846
PVD 0043 0328 1503 0848 0000 2654
SFO 2704 2400 1464 1846 2654 0000
Using the above distance matrix, we can calculate that our example
tour has length of 6969. The first part of the tour makes a certain
amount of sense; we travel down the eastern seaboard from Boston
(BOS) to Providence (PVD) to Baltimore
(BWI). But then we shoot over to San Francisco
(SFO), back across the country to Chicago
(ORD), and down to Dallas Fort Worth (DFW)
before flying back to Boston. If we were to reverse the order of
tokens corresponding to Chicago and San Francisco in the original
tour, we'd get the following tour, (BOS PVD BWI ORD SFO
DFW), of length, 5852, a significant improvement. You might
imagine two tokens swapping positions as being a mutation of
sorts. And you might implement this form of mutation using a
subroutine for generating random numbers. Using such a subroutine, it
is very easy to write a program that examines a list representing a
tour and, with a given fixed probability, for each consecutive pair of
tokens in the sequence, either reverse the order of the two tokens or
leave them as they were. Such a program is called a genetic operator in the
parlance of genetic algorithms. Is this form of mutation biologically
plausible? Perhaps but it doesn't really matter; we're calling the
shots and we're free to slavishly emulate natural selection to the
extent that we understand it or depart from reality as we see fit.
Other forms of mutation might not work out so well for our purposes. If our mutation routines were free to substitute any one token for any other token with some probability, that might result in a tour visiting the same city twice or, worse, inserting a city that we didn't have to visit in the first place. Luckily we can tame natural selection to avoid generating genetic misfits, or, in our case, tours that lead us in circles or to destinations we'd rather not visit.
We can also arrange for programs that simulate sex and thereby
enable hypotheses represented as strings of digital DNA to share their
genetic heritage. Our current winner, (BOS PVD BWI ORD SFO
DFW) (length 5852) has a good east coast solution and a so-so
finish. But consider, (BOS BWI PVD DFW SFO ORD) (length
6379); it wanders up and down the east coast but takes a nice swing
down through the south (DFW) over to the west coast
(SFO) and then back across the country stopping in
Chicago (ORD) before returning to Boston
(BOS). Suppose these two were to get together and swap
their best attributes. We break (BOS PVD BWI ORD SFO
DFW) into two equal parts, (BOS PVD BWI) and
(ORD SFO DFW), and the same for (BOS BWI PVD DFW
SFO ORD) resulting in (BOS BWI PVD) and (DFW
SFO ORD). Now we swap parts forming two new sequences:
(BOS BWI PVD ORD SFO DFW) of length 6408 which ain't so
great and a second, golden child, (BOS PVD BWI DFW SFO
ORD) of length 5648, the best we've seen so far. Does this
have any analog in natural selection? Yes, and in the terminology of
genetic algorithms it's called a crossover genetic operator1.
Unlike the simple mutation operator we discussed earlier, crossover
operators require two strings of digital DNA to tango. So, how do
we simulate sex and reproduction in a genetic algorithm?
First of all, note that we're quite clear about what it means for tours, corresponding to strings of digital DNA, to be improving; shorter tours are more desirable and we're searching for the shortest possible tour. In natural selection, individuals are not necessarily getting "better" even on average; natural selection simply selects for genes (which are basically programs for building individuals or pieces of individuals) that manage to do a good job at reproducing themselves. If such genes give rise to individuals that are stronger or smarter, then that's interesting but it needn't be the case. It's certainly possible that natural selection might stumble on a program for building small, fast, mindless individuals that are superior to us in the sense that they reproduce so effectively that they wipe us off the planet. But, as the designer of genetic algorithms, we are free to play fast and loose with what we know about natural selection. In particular, we get to define the criterion that governs who gets to reproduce and how much.
Genetic algorithms operate by analyzing a set of hypotheses referred to as a population, a set of tours in the case of TSP. At any point in its operation, a GA will have a current population that it analyzes in order to determine which entities in the population will get an opportunity to reproduce. Using the results of its analysis, the GA then constructs the next generation by mutating and pairing selected entities from the current generation. The generation so produced becomes the current generation and the GA repeats this process. If an entity is found in a generation that meets some specified termination criterion, the GA will halt and return the entity, otherwise the GA continues until it has examined a specified number of generations. Exactly how entities are selected for reproduction is at the heart of designing a GA.
It shouldn't be hard to convince yourself that not all mutations and crossovers will result in improvements and indeed in our crossover example only one of the two new tours was an improvement on its parents. We probably don't want to give all entities an equal opportunity to reproduce nor can we expect that all of the entities resulting from reproduction will be equally desirable for future reproduction. We use the term fitness to refer to our measure of how desirable or fit a string of digital DNA is for reproduction. The higher the fitness of a string the more likely it will get selected for reproduction. Since shorter tours are better than longer ones we want to choose a fitness function that's inversely related to tour length. Since we're interested in conferring a reproductive advantage within a given population we can focus on the entities within a population but, as I'll explain shortly, this can result in shortsightedness. Let's define the fitness function for a tour to be the difference between the tour and longest (worst) tour in the population. With this definition the longest tour is the least fit, it has a fitness of zero, and the shortest tour is the most fit. How can we use this measure of fitness to select tours for reproduction?
Randomness seems to play a role in natural selection. Occasionally
fit entities fail to reproduce and not-so-fit entities produce more
than one would expect. Randomness helps nature to avoid heading down
a reproductive dead end. In GAs, the process of selection is carried
out like a lottery in which fit entities get more lottery tickets than
less fit entities but even a not-so-fit entity can win the jackpot and
reproduce a whole gaggle of offspring. Typically the probability that
an entity will be allowed to reproduce is proportional to its fitness.
In Scheme, we might calculate the fitness and reproductive probability
of the entities in a population as follows (see the July 15th entry for a
discussion of functional programming and the map operator
in particular).
(define (distribution population)
(let* ((worst (apply max (map tour-length population)))
(fitness (lambda (tour) (- worst (tour-length tour))))
(total (apply + (map fitness population)))
(probability (lambda (tour) (/ (fitness tour) total))))
(map probability population)))
This is a simplified version of what's done in practice. In practice,
a distinction is typically made between individuals (referred to as
phenotypes) and classes of individuals all of whom share the
same genetic code (or genotype). For example, in a population
of TSP tours, it could be the case that the same tour appears multiple
times. In designing a GA, you have to think about not only the
fitness of a particular code but also about the number of times that
the code appears in the population. Having more than one individual
with the same code, provides some insurance that a particularly fit
code won't get lost due to random selection. But we won't worry about
such fine points here. Let's consider the tiny population
corresponding to the five tours that we've mentioned in our
conversation so far. Here's the distribution corresponding to this
population (technically, the distribution corresponds to the five
numbers in the last column which should sum to 1.0 but don't because
they've been rounded for readability to the nearest 0.001).
> (display-distribution (distribution initial-population)) (BOS PVD BWI SFO ORD DFW) - Length: 6969, Fitness: 0000, Probability: 0.000 (BOS PVD BWI ORD SFO DFW) - Length: 5852, Fitness: 1117, Probability: 0.311 (BOS BWI PVD DFW SFO ORD) - Length: 6379, Fitness: 0590, Probability: 0.164 (BOS BWI PVD ORD SFO DFW) - Length: 6408, Fitness: 0561, Probability: 0.156 (BOS PVD BWI DFW SFO ORD) - Length: 5648, Fitness: 1321, Probability: 0.368
Given the above distribution, how would we go about picking the next
generation? A typical GA might proceed by choosing a set of entities
in accord with the above distribution. This means that, on average,
there would be 31.1% entities with (BOS PVD BWI ORD SFO
DFW) for a code, 16.4% entities with (BOS BWI PVD DFW SFO
ORD) for a code, and so on. These entities would then be
paired, individually subjected to mutation, pair-wise subjected to
crossover, and perhaps modified by other genetic operators as well.
The resulting mutations and sexual combinations would then comprise
the next generation perhaps with some culling out (again with random
variation) to produce a generation of size equal to that of the
previous generation. That's basically it; the GAs produce generation
after generation of entities until they stumble on an entity
corresponding to whatever we were searching for, a short tour in the
case of TSP.
There are endless variations and often enough lessons from biology can point the way to improved GAs. The above distribution assigned zero probability to the first tour that we looked at. It could be however that an otherwise uninteresting entity has a nice solution to some small part of the overall problems that isn't otherwise represented in the population. It that entity isn't given an opportunity to reproduce, then that piece of the overall solution could be lost. Crossover operators have the property that they can, at least occasionally, construct from two inferior entities an entity of high quality and high fitness.
A problem facing the designers of GAs as well as the designers of search algorithms concerns when to stop. We typically don't know the length of the shortest tour or, more generally, what the best hypothesis looks like, and so GA designers have to guess when to stop. For our little problem involving six cities we can check by performing an exhaustive search of all possible tours.
> (display-optimal-tour (exhaustive-search cities)) (PVD ORD SFO DFW BWI BOS) - Length: 5530
We weren't far off from an optimal solution, but, in general, there is
no computationally tractable way for us to know this. In this case,
there are only 6! = 6 * 5 * 4 * 3 * 2 * 1 = 720 possible
tours. But the factorial function increases faster than any
exponential function (e.g., compare 10! = 3628800 with
210 = 1024 or note that 100! has
over 150 digits while 2100 has only
31). You might think you could just keep track of statistics on
how the entities in populations change over time. If the most fit entity
in the last twenty populations, has the same fitness within some
small tolerance, then perhaps you can't get any better.
Unfortunately, GAs like other search algorithms have a tendency to get stuck on "fitness plateaus" where the entities in populations are all just about the same in terms of fitness and in terms of the availability of useful variations corresponding to pieces of the sought-after solution. The entities in these populations don't have any incentive nor do they have any basis to evolve. In some cases, the GA designer can borrow from nature to create such an incentive and inject variation into a moribund population. You can create complex simulated environments and determine fitness by having entities fight for their survival in these environments. Some GAs make use of the digital analogs of parasites, symbiotes, competition, predators, and catastrophic events that can serve to spur evolution and encourage novel solutions.
Ana-Maria, Shoe, Brandon, Brian, Gideon and Sam studied the literature, invented new algorithms and reinvented old ones, struggled to take the general GA framework and apply it to particular problems, and they wrote code and performed lots of computational experiments. They were elated at times and frustrated at others. Sometimes the way seemed so clear and at others the mountains of books and relevant papers must have seemed pretty daunting. On the whole, I think the experience was positive both as a life experience and in terms of deepening their understanding of computer science. They discovered that genetic algorithms are not a panacea for solving hard computational problems. They learned that the metaphors and lessons from biology can be useful, but that there are other metaphors and mathematical approaches that are just as useful. And they discovered that nothing substitutes for really understanding the computational problem that you're trying to solve; there is little value in blindly applying GAs (or any other search method for that matter) to solving hard computational problems. I was certainly proud of what they achieved and how they went about it, working together as a team and striving to excel as individuals.
I can understand why they were attracted to the general area and to GAs in particular. Whether you buy into the metaphors and methodology of GAs or chose some other approach to searching very large hypothesis spaces, some method of simulating Darwin's dangerous idea is likely to prove useful in building intelligent machines. And keep in mind that simulated evolution is accelerating at an exponential rate keeping pace with advances in computing hardware. I'd be remiss if I didn't point out that, in addition to solving combinatorial problems with simulated evolution, scientists studying natural evolution use computers to simulate their theories thus providing interesting insights into biological systems. I especially like the accounts of such computer studies in Matt Ridley's "The Origins of Virtue" (on understanding the evolutionary basis for altruism and cooperation) and "The Red Queen" (on understanding the advantages of sex in reproduction).
1. The crossover example conveniently used
two tours whose first halves and second halves contained the same
cities, BOS, PVD and BWI in the
first halves and DFW, ORD and
SFO in the second halves. Clearly it's possible to
choose a pair of tours that doesn't have this property. The crossover
operator for TSP has to be designed to avoid creating tours with
multiple occurences of tokens corresponding to the same city. It's an
interesting exercise to design such an operator.