Selecting Tagging SNPs Using Simulated Annealing

This is a computational genomics project on selecting tagging SNPs (single-nucleotide polymorphisms) using simulated annealing.

As you know, DNA of each living being is made of four "letters" - nucleotides: A,T,G and C. This
sequence of nucleotides is unique for each person. Most of the differences between our genomes are represented by SNPs - changes in a one singe "letter" or genetic code. These changes can give person bueatiful green eyes and blonde hair. Or they can cause various sicknesses, such as Alzheimer, cancer or diabetes. That's why "personal medicine" has become so popular. Someone draws a drop of your blood, analyzes the DNA in it and then tells you that you have a high risk of a breast cancer. Everybody is happy.

However, analyzing each possible SNP in your DNA is costly, and, in fact, unnecessary because there is
a strong statistical correlation between changes in the human genome. Namely, if you have a "T" in one place, there is a strong chance you have a "A" in another.So if you want to know how different you are, there is no need to sequence all the possible SNPs in your genome (there are ~300,000 of them). One just need to sequence only some of these SNPs ("tags") and then predict rest using values of these "tags".

There are lots of methods to do this, including hill-climbing methods and dynamic programming. Here is a program that allows selecting a set of tagging SNPs using simulated annealing. I've written it for the "comp biology" class I've taken during my first year at Brown. The archive also includes few datasets you can play with; these are datasets from the HapMap project.

Also, if you are interested in a subject, there is a brief tutorial on tagging SNPs created after paper of B. Halldorsson and S. Istrail "Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies". I just tried to rewrite the paper in a more accessible language, so give all the credits to the authors of the paper.