“Improving data quality is crucial, because if a human genome cannot be independently assembled then the sequence data cannot be sorted into the two sets of parental chromosomes, or haplotypes. This process haplotype phasing will become one of the most useful tools in genomic medicine. Establishing the complete set of genetic information that we received from each parent is crucial to understanding the links between heritability, gene function, regulatory sequences and our predisposition to disease.” J. C. Venter, “Multiple personal genomes await,” Nature, April 2010
For more than a decade, Sorin has been a research leader in the area of haplotype phasing and inference. In 2001, his group at Celera published the first paper on haplotype assembly. (The phasing algorithms developed by his Celera group were instrumental in the genome-wide haplotype phasing of the first diploid individual genome in 2007.) In 2002, he proposed a research program on the haplotype phasing problem akin to one in computer science that resolved the problem of characterizing the class of computable functions. The goal was to develop a unifying statistical-algorithmic theory of the haplotype phasing models proposed in the literature: parsimony phasing, maximum likelihood phasing, coalescent phasing, and Bayesian phasing.
Sorin’s latest NSF grant supports his work on “Clark Consistency Graphs and Haplotype Phasing Algorithms”—a project devoted to unifying algorithmic strategies for the long-range haplotype phasing problem, i.e., the computational inference of the mother and father haplotypes for each genotyped member in a sample from a population of individuals. The goal of the project is to develop practical haplotype phasing algorithms applicable to a very large sample of individuals. The algorithms will employ the graph theory structure of long common haplotype regions, identical by descent, that are shared by the individuals in the sample. This approach is based on the algorithmic model pioneered by Andy Clark (1990) and elaborated through graph theory via Clark Consistency Graphs introduced by Sharan, Halldorsson and Istrail (2006) and the long-range phasing graphs employed by deCODE Genetics (Kong et al., 2008).
One major application for haplotype phasing is in genome-wide association studies (GWAS), whose current samples have data sets with billions of items, i.e., containing thousands of individuals genotyped at a half-million to one million single-nucleotide polymorphisms each. Unlike the Icelandic population, for example, which was successfully phased due to its well-recorded pedigree structure, the U.S. population, with its very complex and less-known pedigree structure, requires practical phasing algorithms for a sample of about one million individuals. The first beneficiary of this project’s outcome is expected to be the International Multiple Sclerosis Genetic Consortium, whose GWAS Analysis Group includes Sorin’s lab.