# 2019 Course Chapters

#### 1. Linkage Disequilibrium (LD) and the Unification of LD Measures Problem

#### 2. The Coalescent: General Theory, Polya Urn Games and Ewens Sampling Lemma

#### 3. Genome-Wide Association Studies (GWAS): The Computational Pipeline, Genomic Common and Rare Variants, Statistical Tests of Association, Genetic Heterogeneity and the Missing Heritability Puzzle, Statistical Models of "Disease"

#### 4. Algorithmic Problems in Haplotype Reconstruction and Tagging SNPs Selection

#### 5. Markov Chain Monte Carlo theory and practice and the Population Genomics Substructure Problem

#### 6. Genomic Privacy Algorithms

#### 7. Causality and Prediction

#### News

Students who took this class in the past belonged to departments of computer science, mathematics, physics, chemistry, statistical sciences, engineering, medical school; a number faculty members from Brown Medical School audited the class as well. Biological background is not required for graduate and advanced undergraduates in computer science and mathematical students; the class will be self-contained.

Life science and medical students are all welcome.

This is a full-lecture graduate course dedicated to a type of Big Data sets in human genomics called “genome-wide association studies (GWAS)”. The genome as a model of Big Data is characterized by *“-omes sweet –omes,”* i.e., a large set of genomic systems of complexity -- each one a world of its own, representing a biological functional component of the cell -- all aligned to the human genome assembly axis. Genome, transcriptome, proteome, SNP-ome, haplotype-ome are examples of –omes employed in this course. The class will focus on computational methods used in GWAS, where we do not treat genomic tools as black boxes – we open the box and go into the computational methods in full statistical and algorithmic detail with the tool-builder attitude and aim to be able to improve the speed and accuracy of the methods. We also believe that critical analysis of methods is essential for scientific work. The class assignments and discussions will encourage debate and expression of critical opinion as an important component of our class work.

GWASs have been aimed at detecting variants at genomic loci that are associated with disease in the population and, in particular, at detecting associations between single-nucleotide polymorphism (SNPs) and common disease, such as diabetes, heart disease, cancer and psychiatric disease.

A typical “perfect” input (i.e., error-free and with complete information) for a GWAS computational problem for the disease D is a 2n x m matrix of 0s and 1s. Pair of rows corresponds to the two sequences of SNPs of one individual’s genome (humans are diploids so one row represents the SNPs sequence inherited from the mother, and the other row the sequence of SNPs inherited from the father). Each column of the matrix corresponds to the position or genomic locus of a SNP on the human genome. “Variants at genomic loci” are assumed to be binary and represented by 0 and 1. Obtaining the values of the entries in the matrix are the result of “genotyping” the individuals at the corresponding SNPs (usually with a SNP chip or SNP array). The rows of the input matrix are divided in two parts: “Cases” are the top one-third of the rows that correspond to the individuals affected by the disease D; “Controls” are the bottom two-thirds of the rows that correspond to the two parents (assumed unaffected by the disease) of the affected individuals. Usually, the number of individuals n is several tens of thousands and the number of SNPs is several million, resulting in input matrices in the order of about a few billion entries. The computational goal, intuitively, is to discover special patterns of SNPs (i.e., of 0s and 1s) that are more prevalent in the Cases as compared with the Controls; the positions of these special patterns on SNPs on the genome implicate, by “guilt by association,” the genes close to which, or within which, these SNPs are located.

GWASs are based on the principle of Linkage Disequilibrium (LD) at the “population” level, between genotyped SNPs and un-genotyped “causal variants.” LD is the nonrandom association between the variants at different loci on the genome. The set of n individuals is a “sample” extracted (somewhat, but not really random) from a certain population of individuals (say, people living in New England); many aspects of the GWAS aim at statistical analyses associated with such a study design, e.g., genealogy inference models such as the Coalescent, the statistical power of the sample to reveal associations with the disease phenotype, and the many types of confounding effects and corrections needed due to the “subpopulation structure” of the sample and the multiple hypotheses testing intrinsic to such high-throughput genomic data set.

GWASs cost about 10 million dollars per study, and their current state of scientific achievement has received quite critical reviews. Essentially the main criticism is: *“To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with with various illnesses and traits. However, the vast majority of such variants have no established biological relevance to disease or clinical utility for prognosis or treatment.”* (J. McClell and M.C. King, 2010). Despite that, there is tremendous novel science that emerged due to the GWAS studies, in the 7+ years of their existence. The nature of medical evidence and reasoning in the genomics context of disease has become much more rigorous, improving significantly from, really, a severe lack of rigor in the previous period. The area of Population Genomics and Human Genetics, the 21st century incarnation of the Statistical Population Genetics, is one of the most exciting areas of biological and biomedical research as well as applied mathematics today. *“There have been few, if any, similar bursts of discovery in the history of medical research”* (Hunter and Kraft, 2007). In this scientific renaissance, computational and mathematical methods play an essential role. Statistical methods, the continuous mathematics component, dominate, but there is a considerable rise in importance for the algorithmic and combinatorial methods, the Discrete Mathematics component, that is not yet fully appreciated. This is a subject tof tensions and a bottleneck for enhanced rigor in the computational method approach. Practical statistical methods require exponential time; the generalizations of such methods to genome-wide size could only be obtained by algorithmic graph theory, computer science, and combinatorial methods. This course would delve deeply into these hybrid statistical-algorithmic strategies emphasizing the raise of the Discrete Mathematics component in computational methods for GWAS. These methods include graph theory algorithm, Expectation Maximization (EM), Hidden Markov Models (HMM), Markov Chain Monte Carlo (MCMC).

“Nothing in Biology Makes Sense Except in the Light of Evolution” (T.Dobzhansky). Really, no interesting problem can be solved without a mathematical evolutionary model that hypothesizes how the sampled individuals are related. Evolutionary models can be described by simple games with urn and balls.

The Polya urn games. *An urn contains r red balls and b blue balls, n=r+b. A new urn is constructed as follows. We repeat n times the following: we draw a balls at random from the first urn, put the ball back in the urn, but also put a ball of the same color in the new ball.*

Those games present simple models for gene evolution, e.g. if one assumes a probability of error of returning or putting the wrong color ball in the urns, then one obtains the “Wright-Fisher model”; or, introducing a special black ball that once drawn, it is returned, but together with a ball of new color, one obtains the “infinite allele model.”

Those games present simple models for gene evolution, e.g. if one assumes a probability of error of returning or putting the wrong color ball in the urns, then one obtains the “Wright-Fisher model”; or, introducing a special black ball that once drawn, it is returned, but together with a ball of new color, one obtains the “infinite allele model.”

The themes of special interest this year in the course include: genomics architecture of Identical-by-Descent (IBD) tracts, Li-Stephens recombination framework, Fisher theory of junctions, long-range haplotype phasing and genome-wide haplotype assembly, inference of parent-of-origin alleles, runs of homozygosity (ROH), population substructure, genetic heterogeneity, the common disease rare (and common) variants hypothesis, the “Missing Heritability” problem, models for detecting gene-gene and gene-environmental interactions. We will also discuss related topics and dilemmas such as genomic privacy dilemmas of GWAS, and the search for computer science solutions, as well as pharma's prisoner dilemma and economic behavior.

The Missing Heritability problem is one of the problems we will discuss in detail. The difficulty of designing algorithms for detecting gene–gene interactions, expected to be a big part of the solution, is well exemplified by the following puzzle due to Leslie Valiant (1985):

The light bulbs puzzle. *We have a sequence of n random light bulbs each of which is either on or off with equal probability at each time step. Further, we know that a certain pair of bulbs is positively correlated. The problem is to find efficient algorithms for recognizing the pair of light bulbs with the maximum correlation.*

Course Flyers

Lectures given by invited speakers from past years of CSCI2820 are available on the Multimedia page.