The Regulatory Genome, Gene Regulatory Networks and Transcriptomics
Research Summary
Eric Davidson's Regulatory Genome for Computer Science: Causality, Logic, and Proof Principles of the Genomic cis-Regulatory Code
In this article, we discuss several computer science problems, inspired by our 15-year-long collaboration with Prof. Eric Davidson, focusing on computer science contributions to the study of the regulatory genome. Our joint study was inspired by his lifetime trailblazing research program rooted in causal gene regulatory networks (GRNs), system completeness, genomic Boolean logic, and genomically encoded regulatory information. We present first four inspiring questions that Eric Davidson asked, and the follow-up, namely, seven technical problems,In this article, we discuss several computer science problems, inspired by our 15-year-long collaboration with Prof. Eric Davidson, focusing on computer science contributions to the study of the regulatory genome. Our joint study was inspired by his lifetime trailblazing research program rooted in causal gene regulatory networks (GRNs), system completeness, genomic Boolean logic, and genomically encoded regulatory information. We present first four inspiring questions that Eric Davidson asked, and the follow-up, namely, seven technical problems, fully or partially resolved with the methods of computer science. At the center, and unifying the intellectual backbone of those technical challenges, stands “Causality.” Our collaboration produced the causality-inferred cisGRN-Lexicon database, containing the cis-regulatory architecture (CRA) of 600+ transcription factor (TF)-encoding genes and other regulatory genes, in eight species: human, mouse, fruit fly, sea urchin, nematode, rat, chicken, and zebrafish. These CRAs are causality-inferred regulatory regions of genes, derived experimentally through the experimental method called “cis-regulatory analysis” (also known as the “Davidson criteria”). In this research program, causality challenges for computer science show up in two components: (1) how to define data structures that represent the causality-inferred, by the Davidson criteria, DNA structure data and to define a versatile software system to host them; and (2) how to identify by automated software for text analysis the experimental technical articles applying the Davidson criteria to the analysis to genes. We next present the cisGRN-Lexicon Meta-Analysis (Part I). We conclude the article with some reflections on epistemology and philosophy themes concerning the role of causality, logic, and proof in the emerging elegant mathematical theory and practice of the regulatory genome.
It is challenging to explain what “explanation” is, and to understand what “understanding” is, when the technical task is to “prove” system-level causality completeness of a 50-gene causal GRN. Within the Peter-Davidson Boolean GRN model, the Peter-Davidson completeness “theorem” provides a seminal answer: Experimental causality system completeness = Computational exact prediction completeness.
The article is organized as follows. Section 2 is dedicated to our Prof. Eric Davidson. Section 3 gives a brief introduction for computer scientists to the regulatory genome and its information processing operations in terms similar to the electronic computer. Section 4 proposes to honor Eric Davidson's life-long scientific work on the regulatory genome by naming a most fundamental time unit constant after him. Section 5 presents four grand challenge questions that Eric Davidson asked, and seven follow-up problems inspired by the first two questions, which we fully or partially solved together. Central to the mentioned solutions is our construction of the cisGRN-Lexcion, the database of causally inferred CRA of 600+ regulatory genes in eight species. Section 6 presents Part I of the cisGRN-Lexcion Meta-Analysis, coached as “rules” of the genomic cis-regulatory code. Section 7 is devoted to reflections on epistemological and philosophical themes: causality, logic, and proof in the elegant mathematical modeling of the regulatory genome. We present here the “Davidsonian Causal Systems Biology Axioms,” which guide us toward understanding of the meaning of “proving” causality completeness, for a complex experimental system, by exact computational predictions.
How Does the Regulatory Genome Work?
The regulatory genome controls genome activity throughout the life of an organism. This requires that complex information processing functions are encoded in, and operated by, the regulatory genome. Although much remains to be learned about how the regulatory genome works, we here discuss two cases where regulatory functions have been experimentally dissected in great detail and at the systems level, and formalized by computational logic models.The regulatory genome controls genome activity throughout the life of an organism. This requires that complex information processing functions are encoded in, and operated by, the regulatory genome. Although much remains to be learned about how the regulatory genome works, we here discuss two cases where regulatory functions have been experimentally dissected in great detail and at the systems level, and formalized by computational logic models. Both examples derive from the sea urchin embryo, but assess two distinct organizational levels of genomic information processing. The first example shows how the regulatory system of a single gene, endo16, executes logic operations through individual transcription factor binding sites and cis-regulatory modules that control the expression of this gene. The second example shows information processing at the gene regulatory network (GRN) level. The GRN controlling development of the sea urchin endomesoderm has been experimentally explored at an almost complete level. A Boolean logic model of this GRN suggests that the modular logic functions encoded at the single-gene level show compositionality and suffice to account for integrated function at the network level. We discuss these examples both from a biological-experimental point of view and from a computer science-informational point of view, as both illuminate principles of how the regulatory genome works.
OCR-based image features for biomedical image and article classification: identifying documents relevant to cis-regulatory elements
Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles.Images form a significant and useful source of information in published biomedical articles, which is still under-utilized in biomedical document classification and retrieval. Much current work on biomedical image retrieval and classification employs simple, standard image features such as gray scale histograms and edge direction to represent and classify images. We have used such features as well to classify images in our early work [5], where we used image-class-tags to represent and classify articles.
In the work presented here we focus on a different literature classification task, motivated by the need to identify articles discussing cis-regulatory elements and modules in the context of understanding complex gene-networks. The curators who try to identify such articles in the vast literature use as a major cue a certain type of image in which the conserved cis-regulatory region on the DNA is shown. Our experiments show that automatically identifying such images using common image features (like those mentioned above) can be highly error prone. However, using Optical Character Recognition (OCR) to extract alphabet characters from images, calculating character distribution and using the distribution parameters as image features, allows us to form a novel representation of images, and identify DNA-content in images with high precision and recall (over 0.9). Utilizing the occurrence of such DNA-rich images within articles, we train a classifier that identifies articles pertaining to cis-regulatory elements with a similarly high precision and recall. The use of OCR-based image features has much potential beyond the current task, to identify other types of biomedical sequence-based images showing DNA, RNA and proteins. Moreover, the ability to automatically identify such images has much potential to be widely applicable in other important biomedical document classification tasks.
Relevant Papers
-
OCR-based Image Features for Biomedical Image and Article Classification: Identifying Documents Relevant to Cis-Regulatory Elements
2012Hagit Shatkay, Ramya Narayanaswamy, Santosh Nagaral, Na Harrington, Dorothea Blostein, Ryan Tarpine, Kyle Schutter, Rohith Mv, Gowri Somanath, Sorin Istrail, Chandra Kambahmettu
-
The Genome of the Sea Urchin Strongylocentrotus purpuratus
2006Sea Urchin Genome Sequencing Consortium, Erica Sodergren, George M. Weinstock, Eric H Davidson, R. Andrew Cameron, Richard A. Gibbs, Robert C. Angerer, Lynne M. Angerer, Maria Ina Arnone, David R. Burgess, Robert D. Burke, James A. Coffman, Michael Dean, Maurice R. Elphick, Charles A. Ettensohn, Kathy R. Foltz, Amro Hamdoun, Richard O. Hynes, William H. Klein, William Marzluff, David R. McClay, Robert L. Morris, Arcady Mushegian, Jonathan P. Rast, L. Courtney Smith, Michael C. Thorndyke, Victor D. Vacquier, Gary M. Wessel, Greg Wray, Lan Zhang, Christine G. Elsik, Olga Ermolaeva, Wratko Hlavina, Gretchen Hofmann, Paul Kitts, Melissa J. Landrum, Aaron J. Mackey, Donna Maglott, Georgia Panopoulou, Albert J. Poustka, Kim Pruitt, Victor Sapojnikov, Xingzhi Song, Alexandre Souvorov, Victor Solovyev, Zheng Wei, Charles A. Whittaker, Kim Worley, K. James Durbin, Yufeng Shen, Olivier Fedrigo, David Garfield, Ralph Haygood, Alexander Primus, Rahul Satija, Tonya Severson, Manuel L. Gonzalez-Garay, Andrew R. Jackson, Aleksandar Milosavljevic, Mark Tong, Christopher E. Killian, Brian T. Livingston, Fred H. Wilt, Nikki Adams, Robert Bell, Seth Carbonneau, Rocky Cheung, Patrick Cormier, Bertrand Cosson, Jenifer Croce, Antonio Fernandez-Guerra, Anne-Marie Genevire, Manisha Goel, Hemant Kelkar, Julia Morales, Odile Mulner-Lorillon, Anthony J. Robertson, Jared V. Goldstone, Bryan Cole, David Epel, Bert Gold, Mark E. Hahn, Meredith Howard-Ashby, Mark Scally, John J. Stegeman, Erin L. Allgood, Jonah Cool, Kyle M. Judkins, Shawn S. McCafferty, Ashlan M. Musante, Robert A. Obar, Amanda P. Rawson, Blair J. Rossetti, Ian R. Gibbons, Matthew P. Hoffman, Andrew Leone, Sorin Istrail, Stefan C. Materna, Manoj P. Samanta, Viktor et al. Stolc
-
The Sequence of the Human Genome
2001J. Craig Venter, Mark D. Adams, Eugene W. Myers, Peter W. Li, Richard J. Mural, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, Cheryl A. Evans, Robert A. Holt, Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Huson, Jennifer Russo Wortman, Qing Zhang, Chinnappa D. Kodira, Xiangqun H. Zheng, Lin Chen, Marian Skupski, Gangadharan Subramanian, Paul D. Thomas, Jinghui Zhang, George L. Gabor Miklos, Catherine Nelson, Samuel Broder, Andrew G. Clark, Joe Nadeau, Victor A. McKusick, Norton Zinder, Arnold J. Levine, Richard J. Roberts, Mel Simon, Carolyn Slayman, Michael Hunkapiller, Randall Bolanos, Arthur Delcher, Ian Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy, Clark Mobarry, Knut Reinert, Karin Remington, Jane Abu-Threideh, Ellen Beasley, Kendra Biddick, Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, Valentina Di Francesco, Patrick Dunn, Karen Eilbeck, Carlos Evangelista, Andrei E. Gabrielian, Weiniu Gan, Wangmao Ge, Fangcheng Gong, Zhiping Gu, Ping Guan, Thomas J. Heiman, Maureen E. Higgins, Rui-Ru Ji, Zhaoxi Ke, Karen A. Ketchum, Zhongwu Lai, Yiding Lei, Zhenya Li, Jiayin Li, Yong Liang, Xiaoying Lin, Fu Lu, Gennady V. Merkulov, Natalia Milshina, Helen M. Moore, Ashwinikumar K Naik, Vaibhav A. Narayan, Beena Neelam, Deborah Nusskern, Douglas B. Rusch, Steven Salzberg, Wei Shao, Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, Ron Wides, Chunlin Xiao, Chunhua et al. Yan
-
Visualization challenges for a new cyberpharmaceutical computing paradigm
2001Russell J. Turner, Kabir Chaturvedi, Nathan J. Edwards, Daniel Fasulo, Aaron L. Halpern, Daniel H. Huson, Oliver Kohlbacher, Jason R. Miller, Knut Reinert, Karin A. Remington, Russell Schwartz, Brian Walenz, Shibu Yooseph, Sorin Istrail