CSCI2951-N: Advanced Algorithms in Computational Biology

====================================== ======================================

An overview of class activities from the class of Fall 2015

Class Activities - Final Projects and Presentations
-----------------------------------------------------
November 25 - December 8: - Homwework 4 * 2

November 25: - 1-page description of the Final Project due and The choice of the GWAS disease area/paper presentation (using "Guilt-by-Association" detective metaphors to convey the GWAS study findings)

-----------------------------------------------------

December 9: Genome-Wide Association Studies (GWAS) presentations (2:30-4:00pm in the SWIG)

Theme: "Guilt-by-Association" -- detective/police-like terminology/metaphors for types of evidence for the "guilt" in the search for genetic determinants of disease

Dilum Aluthge: "TBA (Asthma)"

Alex Constantino: "The Mystery of the Split Mind" (Schizophrenia)

Sam Angelo Crisanto: "Nature vs. Nurture: Tracking Down the Genetic Risk Factors for Obesity"

Daniel Seidman: "Asthma is Like a Box of Chocolates"

Ben Siranosian: "An association so strong it'll stop your heart: Searching for SNPs in atrial fibrillation"

-----------------------------------------------------

December 17: Noon-2:00pm in the SWIG (Special lecture)

Hammurabi Mendes (Department of Computer Science, Brown Univeristy)

What Voting Theory can tell Protein Folding Algorithm Designers?

Abstract: Protein Folding is concerned with the reasons and mechanism behind a protein's tertiary structure. The thermodynamic hypothesis of Anfinsen postulates an universal energy function (UEF) characterizing the tertiary structure, defined consistently across proteins, in terms of their aminoacid sequence. We consider the approach of examining multiple protein structure descriptors in the PDB (Protein Data Bank), and infer individual preferences, biases favoring particular classes of aminoacid interactions in each of them, later aggregating these individual preferences into a global preference. We view the interactions between amino acids as "voting" information, and discuss the viability of a mechanism generating the aggregate global preference - which can potentially give insights into the postulated UEF. Aggregating individual preferences in a global preference is one of the concerns of Social Choice Theory. In 1951, Kenneth Arrow stated his now famous impossibility theorem which defines apparently innocuous properties that functions that aggregate preferences, called social welfare functions, should respect. The theorem shows that these properties, discussed in the following section, were conflicting. Many other concepts, assumptions, and impossibility results followed. In this talk, we consider concepts and impossibility results from voting theory and unveil methodological difficulties for the aggregation mechanism mentioned above. We highlight how key theoretical barriers, already exposed by economists, can be relevant for the development of new methods and algorithms for problems related to protein folding.

-----------------------------------------------------

December 19: Final Projects Presentations (Noon-3:00pm in the SWIG)

Dilum Aluthge: "Towards an Information-Theoretic Measure of Linkage Disequilibrium"

Alex Constantino: "Toward Faster Phasing"

Sam Crisanto: "Variable-Memory Algorithms: How far into the past should we look in order to predict the future?"

Daniel Seidman: "Lumbertract: Haplotype Phasing and Identical by Descent Tracts"

Ben Siranosian "From the atomic bomb to aging cells: Applying MCMC and the Metropolis algorithm to chromatin conformation"
-----------------------------------------------------

Class 23 (December 4)
Heritability and Genetic Heterogeneinty
The Heritability equations
The Search for Meaning in GWAS and the seminal paper of Jon McClellan and Mary-Claire King "Genetic Heritability in Human Disease", CEll, Leading Edge Essay, No. 141, April 16, 2010

Class 22 (December 2)
The Coalescent and the Coalescent with Recombination: Analytical Theory
Simulating the Coalescent (Kingman);
Simulating the Coalescent with Recombination (Hudson)

Class 21 (November 25)
The Coalescent (continued)
Wright-Fisher Models Poisson approximation and estimation of the most recent common ancestor (MRCA) population size
The Coalescent with Recombination

Homework 4 available; Due December 8, 2014.

Class 20 (November 20)
The Minichiello-Durbin Algorithm (continued)
Shared Tracts in haplotypes
The most case-controls discriminating Marginal Tree of an ARG
The Coalescent with Recombination

Class 19 (November 18)
The Coalescent
The Ancestral Recombination Graph (ARG) and Marginal Trees
The Minichiello-Durbin Algorithm (inferring genealogies having mutations, recombinations and coalescence)

Class 18 (November 13)
The STRUCTURE Algorithm: Independence assumptions (LE and HWE), the Bayesian posterior distribution approach, models withtout admixture and with admixture, the algorithm, using the Law of Large Numbers to sample the space; a short intro to the Propp-Wilson "coupling from the past"; the Ising Model of ferromagnetism

Class 17 (November 11)
The Metropolis Algorithm; an example the Hard-Core model of statistical physics; The Law of Large Numbers and Markov Chain Monte Carlo sampling for independent runs and summary statistics

Class 16 (November 6)
Markov Chain Theory: Stationary distributions, the existence and uniqueness theorem on the stationary distributions, the total variation distance, the convergence theorem, reversible Markov Chains, random walks on graphs, Markov Chain Monte Carlo simulations.

Class 15 (November 4)
The Population Substructure Problem - an introduction
The Markov Chains Monte Carlo Method
The STRUCTURE Algorithm of Jonathan Pritchard, Mattew Stephens and Peter Donnelly (2000)

Homework 3 available; Due November 12, 2014.

HW3 will be given tomorrow Friday October 31 midnight instead of Thursday night (forgot my laptop batery during class in the SWIG - apologies). Class 12 on Oct 28 contains materials that are part of homework readings and preparations regarding the "Adopt a Disease/GWAS" and make a class presentation. They materials are available on the class webpage.

Class 14 (October 30)
The Long-Range Haplotype Phasing Algorithm - continued (fine scale recombination inference; phasing untyped people; de novo mutations)
The Transmission Disequilibrium Test (TDT)
Herbert Simon's "Hora and Tempus" puzzle: A metaphor about Evolution

Class 13 (October 28)
The Long-Range Haplotype Phasing Algorithm

Class 12 (October 23)
Genome-Wide Association Studies (GWAS) - Computational Workflows - Rare Variants
Adopt a disease/GWAS - Preparing for a GWAS presentation
Guilt by association - A "police detective story" view/editorial about the search for meaning in GWAS: revealing evidence about genomic variants confering disease risk

Next class on October 21 is cancelled (Professor attending the American Society for Human Genetics annual meeting in San Diego). The class is postponed for later in the semester (time TBA).

Class 11 (October 16)
Genome-Wide Association Studies (GWAS) - Computational Workflows - Common Variants

Class 10 (October 14)
The Computational Complexity of the mutually reducible to each other problems: the Set Cover Problem and the Dominating Set Problem; the two problems capture, respectively, the optimization problems of the Informativeness and LD-select/Tagger tagging SNPs algorithms

Homework 2 available; Due October 21, 2014.

Class 9 (October 9)
Tagging SNPs: The Informativeness Algorithm

Class 8 (October 7)
Tagging SNPs: The LD Select Algorithm

Class 7 (October 2)
The Hubbell Theorems: Parsimony Haplotype Phasing by is NP-hard, and, Maximum Likelihood Haplotype Phasing is NP-Hard

Class 6 (September 30)
Readings: 1) The Seminal paper of Excoffier-Slatkin on the Expectation-Maximization algorithm for computing haplotype frequencies and haplotype phasing.

Class 5 (September 18)
Readings: 1) An article containing an elementary proof of the Ewens Sampling Lemma. It also contains some generalizations of the Lemma. 2) An article containing applications of the Ewens Sampling Lemma.

Next two classes on September 23 and 25 are cancelled (Professor attending a conference at Berkeley U). They are postponed for later in the semester (time TBA).

Homework 1 available; Due September 30, 2014.

Class 4 (September 16) materials available now.
Readings: 1) Andy Clark's seminal 1990 paper introducing his method for haplotype phasing, i.e. inference of haplotypes from PCR-amplified samples of diploid populations.
2) D. Gusfield's paper about graph theoretical formulation of the haplotype phasing problem and his theorem that computing the optimal solution in the Clark sense of parsimony, namely "phase the most genotypes" is NP-complete.

Class 3 (September 11) materials available now.
Readings: 1) A comparison paper of LD measures; power and caveats on the problem of measuring LD. 2) Classical statistical hypothesis testing procedure: an in depth presentation of rigorous steps usually skipped in textbooks. 3) Sir Ronald Fisher's famous puzzle "The Lady Tasting Tea" - try to solve it.

Class 2 (September 9) materials available now.
Readings: 1) R. Lewontin's seminal paper introducing D. It is good to read the classic papers. 2) M. Slatkin's review paper about Linkage Disequilibrium. We will come back during the semester with lecturs on topics covered by the review. It is a good technical read to get a first glance at the broad range of applications of linkage disequilibrium.

Class 1 (Septenmber 4) materials available now, see Schedule link.

course description

This is a full-lecture, graduate course on algorithms and biomedical applications. The Foundations lectures are an introduction to the biological and medical genomics application areas. Each Algorithm section is devoted to an algorithmic method presented in rigorous depth, followed by an important open problem in the application area, together with the current most effective algorithmic solutions to the problem. Graduate students and advanced undergraduates in computational and mathematical sciences and engineering are welcome. Biological, life sciences and medical students and faculty are welcome as well and will be able to participate more in the applications areas. Medical applications areas: autism and mental disease, HIV and preterm birth.

Biology background is not required; biological motivation and background for the problem areas covered in class will be presented in class and further expanded through readings. Although the ideal preparation for students for taking this class would be CSCI2820, computer science and mathematical sciences and engineering graduate students and advanced undergraduates are welcome even if new to the area. Please talk to the professor if unsure if your background is appropriate.

Genomics Foundations: Modeling and Measuring Evolution: Linkage Disequilibrium (LD), Urn Models

Lectures in this section focus on DNA and protein genomics data and evolutionary models. Fundamental to understanding genomic data is the study of its empirical patterns of genetic variation as measured by Linkage Disequilibrium (LD). As "Nothing in Biology Makes Sense Except in the Light of Evolution" (Theodosius Dobzhansky), several mathematical models of evolution will be introduced. Here we will study the inspiring work of Sir Ronald Aylmer Fisher. Urn models are a statistical framework to define evolutionary models. An example. We start with an urn U0 containing n balls, k of them white balls and n-k black balls; one random ball b is extracted from the urn, an a balls of the same color as b is placed in another urn U1, while the ball b is returned in the first urn. We do this n times. Now urn U1 has n balls as well. And so on. As the process is probabilistic, one can study various properties of this stochastic process. Think of balls as genes, the content of an urn as a population, and the the process as transmitting genes to the next generation. Various generalizations of this game model essential components of the evolutionary process.

Genome-Wide Association Studies (GWAS): Statistical associations, the missing heritabililty problem, genetic heterogeneity, genomic privacy.
Algorithms: Maximum Likelihood and Expectation-Maximization Algorithms: Inferring haplotype frequencies in populations.
Algorithms: Set-cover and Minimum Informative Subset Algorithms: Tagging SNPs selection, LD.
Algorithms: Markov Chain Monte Carlo Algorithms: Population Substructure.
Algorithms: Statistical Hypothesis Testing and Knapsack Algorithms: The Neyman-Pearson Lemma, Multiple Testing.
Algorithms: The Social Network of Protein Folds: Individual Preferences of Amino Acids and the Thermodynamic "Social Choice" Hypothesis. Voting Theory Algorithms and von Neumann-Morgenstern Utility Theory: Protein Folding Energy Function Inference.

contact

professor: Sorin Istrail
office: CIT 523
email: sorin@cs.brown.edu
office hours: Monday 4:00-5:00pm or by appointment in my office

CSCI2951-N: Advanced Algorithms in Computational Biology

Fall 2016, Tuesday/Thursday2:30pm-3:50pm in CIT 241

An overview of class activities from the class of Fall 2015

Class Activities - Final Projects and Presentations

December 9: Genome-Wide Association Studies (GWAS) presentations (2:30-4:00pm in the SWIG)

December 17: Noon-2:00pm in the SWIG (Special lecture) Hammurabi Mendes (Department of Computer Science, Brown Univeristy)

course description

Genomics Foundations: Modeling and Measuring Evolution: Linkage Disequilibrium (LD), Urn Models

Genome-Wide Association Studies (GWAS): Statistical associations, the missing heritabililty problem, genetic heterogeneity, genomic privacy.

Algorithms: Maximum Likelihood and Expectation-Maximization Algorithms: Inferring haplotype frequencies in populations.

Algorithms: Set-cover and Minimum Informative Subset Algorithms: Tagging SNPs selection, LD.

Algorithms: Markov Chain Monte Carlo Algorithms: Population Substructure.

Algorithms: Statistical Hypothesis Testing and Knapsack Algorithms: The Neyman-Pearson Lemma, Multiple Testing.

Algorithms: The Social Network of Protein Folds: Individual Preferences of Amino Acids and the Thermodynamic "Social Choice" Hypothesis. Voting Theory Algorithms and von Neumann-Morgenstern Utility Theory: Protein Folding Energy Function Inference.

contact

Fall 2016, Tuesday/Thursday
2:30pm-3:50pm in CIT 241

December 17: Noon-2:00pm in the SWIG (Special lecture)

Hammurabi Mendes (Department of Computer Science, Brown Univeristy)