Darwin's Dangerous Algorithm

Supplementary References

If you have the chance, I recommend taking an introductory course in molecular biology; it's a fascinating field and it's science and technology that will make a big difference in your lives. If can't take a class, then David Clark and Lonnie Russell's Molecular Biology Made Simple and Fun [Clark and Russell2000] is a wonderful introduction to the field that explores fundamental biology, the tools used by molecular biologist and a sampling of applications and relevant technologies. It's technically accurate, fun to read, and it has lots of useful diagrams and illustrations and plenty of detail. When it comes to simulated natural selection, Melanie Mitchell's An Introduction to Genetic Algorithms [Mitchell1996] is a fine book for getting some additional background on genetic algorithms.

Sample Lecture Material

Introduction Genetics and Natural Selection

A gene is a unit of genetic information. Taken together, an organism's genes determine all of its characteristics. Simple organisms like bacteria may have 2,000 to 4,000 genes while more complicated organisms likes plants and animals may have 50,000 to 100,000 genes.

The chemical reactions that govern the sustenance, growth and repair of biological organisms are determined by proteins. Proteins (and, in particular, proteins called enzymes that mediate chemical reactions) are coded in genes and larger ensembles of genes called chromosomes and then produced (or expressed) when needed.

Genes consist of DNA (deoxyribonucleic acid) molecules and can be thought of as sequences of symbols in an appropriate alphabet. In the case of DNA, the alphabet is determined by four molecular bases: adenine, thymine, guanine and cytosine most often identified by their first characters, A, T, G, and C.

We can then think of biological organisms as encoded in sequences of the letters A, T, G, and C, much as we think of computer programs as encoded in sequences of 1s and 0s. Each cell in an organism includes the complete DNA for the organism.

In order to create new organisms or in the processes of growth and repair, new cells are created and the genetic codes (DNA) are passed on to each new cell. In sexual reproduction, two cells combine their DNA to produce a new cell whose DNA is some combination of the DNA of the two parent cells.

This process of genetic recombination is an important mechanism in the production of new organisms that exhibit new combinations of characteristics. Another mechanism responsible for new organisms is mutation in which the genetic code is altered, e.g., by the effects of radiation, foreign substances in the cell, and mistakes in copying.

Natural selection is the ``process by which individual characteristics that are more favorable to reproductive success are `chosen,' because they are passed on from one generation to the next, over characteristics that are less favorable'' [Menand2001].

It's important to note that natural selection is blind to any particular design or plan; the choice or selection of characteristics is determined entirely by reproductive success.

Simulating Natural Selection

The processes of genetic recombination, mutation, and natural selection serve to produce organisms with improved chances of reproductive success. We can simulate these processes in a computer program by defining the digital analogs for genetic codes and developing algorithms that simulate the basic biological processes.

The digital codes can be designed to specify the characteristics of any computational process we care to investigate, e.g., they can specify the characteristics of organisms that correspond to computer programs, class schedules, vehicle routes, etc. We can also define what constitutes reproductive success, e.g., programs that run fast or exhibit a particular input/output behavior, class schedules that minimize conflicts, and vehicle routes that minimize distance.

Genetic algorithms are an approach to solving computational problems by simulating reproduction and natural selection. For any particular problem that we wish to solve using a genetic algorithm, there are a number of design questions that have to be addressed:

What are the characteristics to be modeled and how are they to be encoded in some form of digital DNA?
What are the factors that influence reproductive success and how are pairs of individuals to be selected for reproduction?
Given two individuals selected for reproduction how is their DNA to be modified and combined to produce their offspring?

Let's consider how we might address these questions for the problem of class scheduling. (In the next lecture, we'll look at the problem of vehicle routing (the traveling salesman problem) which is discussed at some length in Chapter 15).

The class scheduling problem is the problem faced by an individual student in choosing classes and class times in order to satisfy the requirements of a particular concentration. Each concentration specifies a set of requirements. Some of these requirements have only one way of being fulfilled (e.g., you must take CS31 and CS22) and others allow options (e.g., you must take either the sequence CS15 followed by CS16 or the sequence CS17 followed by CS18). Some courses have prerequisites (e.g., CS22 is a prerequisite for CS51) and courses are taught at specified times, in specified semesters and, in some cases, only in specified years. In addition to the requirements imposed by the concentration, there might be additional requirements imposed by the student (e.g., I never want to take a class that meets before 10am or I want to spend the second semester of my junior year in Paris studying French architecture).

To answer Question 1 above, devise a data structure to encode a particular schedule over three years (six semesters starting in sophomore year).

A good schedule is one that meets all the requirements of the concentration and all of the additional requirements of the student (call these the hard requirements). You could rate schedules on the basis of how many requirements they satisfy. You could add soft requirements (e.g., I'd like to be finished with all my classes by 4pm on Tuesdays and Thursdays). You could then rate the schedules that satisfy all the hard requirements by how many soft requirements they satisfy. Then choose schedules for reproduction based on this two-tier rating strategy. Such a rating strategy is called a fitness function. This provides an answer to Question 2.

Finally, given two schedules, how might you combine them to create offspring. You could make point mutations by randomly selecting a semester and replacing a course with an alternative one. You could combine parts of one schedule with the corresponding parts from another schedule by, e.g., swapping the classes in selected semesters. This method of combination is often called crossover.

The overall genetic algorithm operates as follows: Start with an initial population of randomly chosen schedules. Rate the individuals according to your fitness function and then select the set of parents for the next generation by randomly selecting pairs of individuals so that the more highly rated an individual the more likely it is to be chosen for reproduction. Subject the individuals to point mutations and then combine their genetic descriptions to produce new individuals using crossover. These new individuals will comprise the population of the next generation. Continue for some fixed number of generations or until an individual is produced that has all the desired characteristics.

A Scheme implementation of a genetic algorithm for solving a version of the traveling salesman problem is available as a file in zip format. This implementation is based on the code in Chapter 15 and contains a solution to the problem mentioned in a footnote regarding the crossover operator.

Code Fragments

Download all the code fragments in this chapter as a file in zip format.