CSCI2820 Final Projects Tentative List
Each student is required to complete a final research project as part of the requirements of CSCI2820. You should work with Professor Istrail to frame your project. The final handin will include a ~15-20 minute presentation of your project results and your chosen GWAS paper, a report describing your project results, and any accompanying source code/documentation. Direct your questions regarding projects to Professor Istrail.
GWAS papers
Here are a few review papers on trends, limitations and open problems in GWAS. These might be useful for thinking of project ideas and getting a better understanding of the state of GWAS:- Benefits and limitations of genome-wide association studies
- A tutorial on statistical methods for population association studies
- 10 Years of GWAS Discovery: Biology, Function, and Translation
- Five Years of GWAS Discovery
- Genetic Mapping in Human Disease
Below is a set of projects and a sample of related papers aligned with the goals of the class. We are hoping these might be a source of inspiration for picking projects. Note: this represents only a sample of projects; you are encouraged to define a project aligned with your interests. Once you pick your project, please enter it on this Google Sheet by the end of the day on 19 March 2021.
Potential Final Porject Topics and Relevant Supplementary Readings
- Epistatic Interactions in GWAS
- Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease
- Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits (L Crawford)
- SIXPAC software (from the Pe'er paper)
- Variable Selection in Deep Learning Models for GWAS
Some useful papers for variable selection in deep learning/neural networks:
- A Spike and Slab Restricted Boltzmann Machine
- Interpretable Outcome Prediction with Sparse Bayesian Neural Networks in Intensive Care
- Concrete Autoencoders for Differentiable Feature Selection and Reconstruction
- The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
- Categorical Reparameterization with Gumbel-Softmax
- Bayesian Variable Selection and Regression Analysis in GWAS and Fine Mapping Studies
- varbvs: Fast Variable Selection for Large-scale Regression (M Stephens)
- A simple new approach to variable selection in regression, with application to genetic fine mapping (M Stephens)
- BAYESIAN VARIABLE SELECTION REGRESSION FOR GENOME-WIDE ASSOCIATION STUDIES AND OTHER LARGE-SCALE PROBLEMS
- Extending Bayesian Variable Selection Models (above) to Nonlinear Activation Functions and Deep Learning to Account for Epistasis
- Check out sections "Variable Selection in Deep Learning Models" and "Bayesian Variable Selection" above
- Gene Sets and GWAS: From selecting SNPs to selecting Genes and Pathways
- Bayesian large-scale multiple regression with summary statistics from genome-wide association studies (RSS by M Stephens)
- Analysing biological pathways in genome-wide association studies
- Pathway analysis of genomic data: concepts, methods, and prospects for future development
- Leveraging models of cell regulation and GWAS data in integrative network-based association studies
- Random forests, decision trees, and GWAS
- GWAS using Ancestral Recombination Graphs (combinatorial approach to GWAS using haplotypes)
- Coalescent-Based Association Mapping and Fine Mapping of Complex Trait Loci
- Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs
- ARGs and Haplotype Phasing
- GENEALOGICAL TREES, COALESCENT THEORY AND THE ANALYSIS OF GENETIC POLYMORPHISMS
- Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs
- Coalescent-Based Association Mapping and Fine Mapping of Complex Trait Loci
- Tag SNPs - unifying LD-select/Tagger and Informativeness - Dominating Set optimization
- Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium
- Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies
- Linkage Disequilibrium in Humans: Models and Data (Jonathan K. Pritchard and Molly Przeworski)
- Minimum informative subset selection algorithm and transferability
- Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium
- Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies
- Supplemental 1
- Supplemental 2
- The portability of tagSNPs across populations: A worldwide survey
- Codon bias in GWAS
- Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium
- Genotype Imputation with Thousands of Genomes
- The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications
- Identity by Descent in GWAS and/or computing identity-by-decent tracts in genotypes
- A Fast, Powerful Method for Detecting Identity by Descent
- Whole population, genome-wide mapping of hidden relatedness
- Rigorous algorithms for Global Maximum Likelihood Phasing, EM and generalized likelihoods
- Long-range haplotype phasing – "the power of amnesia" variable-length Markov Chain and the Browning and Browning Beagle
- Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering
- The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length
- Long-range haplotype phasing – the deCODE algorithm; haplotype sharing in closely related populations
- Cryptic population structure
- Immunogenomics
- Innate Immune and Chemically Triggered Oxidative Stress Modifies Translational Fidelity
- Comparative immunopeptidomics of humans and their pathogens
- Detecting recombination rates and the Li-Stephens framework
- Fast detection of Identical by Descent relatedness
- A Fast, Powerful Method for Detecting Identity by Descent
- Whole population, genome-wide mapping of hidden relatedness
- Parents of origin genetic variation
- Transferability of tagging SNPs across populations
- Generalized family and pedigree based statistical tests for association
- Crypto-GWAS and the maximum non-identifiability problem
- Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays
- The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis
- Pedigree inference from genotypes/haplotypes
- Pedigree Reconstruction Using Identity by Descent
- Efficient maximum likelihood pedigree reconstruction
- Viral quasispecies reconstruction and polyploid haplotype assembly
- QColors: An algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads
- Haplotype assembly in polyploid genomes and identical by descent shared tracts
- Haplotype and genome assembly of metagenomes