Note: The field of statistical genetics continues its unprecedented growth and expansion. Additionally, new technologies and ideas are constantly opening up new research areas. Thus, the projects below are proposed projects only, as we will seek to select the most timely, relevant, and meaningful research for completion during summer 2014. That may change between now and the start of the program.
Project Area #1. Human statistical genetics research background and aims (project topics)
The technological and computational breakthroughs in the years since the sequencing of the human genome have provided an unprecedented opportunity to understand the etiology of complex human diseases. Notably, the diminishing cost of next-generation sequencing means that it is now possible for researchers to obtain complete genome sequence information on many thousands of individuals. However, major statistical questions remain about optimal design and analysis of studies using next-generation sequencing data to study the contribution of rare variation to common diseases. At the foundation of many such questions is the lack of power for single marker, rare variant tests of association, motivating the development of many, potentially more powerful, variant-set tests, which aggregate evidence from many individual variants into a single test statistic. Research by our group has developed a framework for evaluating the performance of existing variant-set tests on moderately sized variant sets (<100 variants). We then utilized this framework to provide a clear understanding of test performance in a variety of circumstances, develop novel robust and powerful tests, evaluate method performance in light of genotype uncertainty, develop methods to more precisely characterize underlying genetic architecture and apply these methods to better understand the genetics of fatty acids and high blood pressure. Continued technological innovations and lower costs mean that we continue to experience rapid increases in sample size, the proportion of the genome that is sequenced and sequence coverage, which all lead to increases in the number of variants being analyzed. However, mounting evidence suggests poor performance of many widely used variant-set methods as the number of variants increases into the hundreds and thousands. Current and newly proposed variant-set based tests which attempt to address large variant situations vary in how they combine and weight variants, leading to poorly understood differences in performance under different genetic models. Regardless of which tests emerge as optimal in the presence of large numbers of genetic variants, several challenges will remain toward applying these methods to real, imperfect data and then inferring underlying genetic architecture based on a statistically significant test result. Our research will start by gaining a deeper understanding of the behavior of variant-set tests in the presence of large numbers of variants, the realistic application of these tests and the development of methods to decompose significant test statistics to gain information that can guide future studies, leading to a variety of novel approaches and better understanding of why certain methods perform the way they do. This work will provide a critical step towards successfully identifying risk variants in future sequencing experiments.
Aim 1: Develop a framework to understand the behavior of variant-set tests We propose a framework for genetic sequence data that relates differences in variation between cases and controls to differences in allele frequency vectors. The framework directly relates parameters in the genetic disease model to test performance, giving analytic insight into the behavior of existing tests. We anticipate this framework to be a valuable tool to facilitate prospective analyses of existing and novel test statistics that avoid laborious simulations and optimize testing strategies based on prior assumptions of particular genetic models.
Aim 2: Evaluate variant set tests in the presence of imperfect data Most variant-set tests have been evaluated on simulated sequence data that assume perfect genotypes, perfect phenotypes and well-defined sets of variants. In reality, all of this data is subject to uncertainty. We will develop comprehensive models of data uncertainty, and then use analytic and simulation methods to incorporate these models into the framework developed in Aim 1, leading to optimal and novel approaches to testing and informed study design.
Aim 3: Develop post-hoc analyses to identify causal variants and inform replication study design Following a statistically significant variant-set association, post-hoc methods are required to decompose the test statistic to extract crucial information for designing replication studies and inferring underlying genetic models. We will develop novel methods to unravel variant-set test statistics to estimate genetic architecture at the locus and utilize this information to determine cost-effective replication designs.
Project Area #2 Bacterial statistical genetics research background and aims (project topics)
Systems level understanding of microbial life requires truly integrated data, methods and tools that will enable researchers to approach heretofore unapproachable problems – transformational problems in the areas of health, environment, energy and food as outlined in the 2009 National Research Council report, “A New Biology for the 21st Century” – but also questions of fundamental biological significance that underpin each of these problem areas. What is the functional diversity that exists in the microbial world? How do those functions interact to produce the ecosystems that we observe? What drives these interactions? How did these systems come to be – to evolve? One critical aspect to answering these types of questions is to acquire a solid, predictive understanding of the metabolic functions and regulatory strategies of microbes and microbial communities. Of equal importance to approaching these questions is the training of the next generation of scientists, mathematicians, computer scientists, and engineers to be fluent in systems level approaches to problems, the “New Biology”. The PIs have a successful track record on integrative approaches to systems biology that uniquely positions them to make significant advances in the integration of data types to address these fundamental questions while simultaneously training undergraduate students at all levels in interdisciplinary, systems science. Three developments position us to be able to approach predictive understanding of metabolic and regulatory features of microbes:
- Recent breakthroughs in next-generation sequencing (NGS) technology are providing unprecedented access to genomic and transcriptomic data across diverse bacterial species far beyond the traditional, well-understood model organisms.
- Over the past five years, there have been significant advances in the development and understanding of genome-wide metabolic networks for bacterial organisms. In particular, the field of metabolic modeling has moved from mostly manual development of metabolic networks (Francke, C. et al. 2005) to semi- and completely-automated approaches, which are being applied to thousands of sequenced bacterial genomes through our earlier efforts (see below). These efforts have leveraged the subsystems approach to genome annotation as implemented in the SEED (Ross Overbeek et al. 2005) and the rapid, accurate annotation of new microbial genome sequences through the RAST (Aziz et al. 2008), to map genome annotations to biochemical reactions and automatically build metabolic networks. These networks serve as the foundation for steady state metabolic modeling techniques such as Flux Balance Analysis (FBA), which use linear programming to analyze the flow of metabolites through a metabolic network (Orth & B. Ø. Palsson 2011). As a result of our work on development of the ModelSEED (C. S. Henry et al. 2010; Matthew DeJongh et al. 2007), we have now generated metabolic models (MMs) for over 3000 microbial organisms annotated using the SEED/RAST system. This serves as a valuable resource for the modeling community to explore properties of metabolic networks and interactions between microbes (Freilich et al. 2011).
- In parallel, there have been significant advances in the characterization of transcriptional regulatory networks (TRNs) on a genome wide scale for sequenced microbes. Methods to reconstruct the TRN for an organism have used two basic approaches: (1) the identification of regulatory binding motifs for specific transcription factors (TFs) in an organism to identify potential TF-target gene relationships (D. A. Rodionov 2007) and (2) the discovery of TF-target relationships through the statistical analysis (machine learning) of genome-wide gene expression data (e.g.,(Faith et al. 2007)). TRNs are being systematically predicted for major bacterial groups using the first approach (e.g., (D. Rodionov et al. 2011; D. Ravcheev et al. 2011)) and made available through databases such as RegPrecise (P. Novichkov, Laikova, et al. 2010).
One of the greatest challenges in the field of metabolic modeling is this – metabolic models must capture gene regulatory information to more accurately model the metabolic response of an organism to its environment (Lewis et al. 2012). Moreover, this integration ultimately needs to be implemented on the scale of integrated models for thousands of organisms.
In this proposal, we outline a series of methodological advances that will (1) effectively combine these three data types (metabolic models, TRNs, and gene expression data) into integrated metabolic regulatory models (iMRMs) and (2) provide applications that will aid researchers in the systematic exploration of metabolic and regulatory functions of microbes.
Aim 1. We will develop new and improved integrated metabolic-regulatory models (iMRMs) by addressing methodological weaknesses in current approaches. Current approaches to the development of iMRMs either ignore or significantly under-interpret expression data. We have identified a series of methodological improvements which will substantially improve iMRM development. In particular, we will develop methods to (a) rigorously estimate gene activation states (active, inactive) from expression data and (b) utilize improved gene state estimates in the creation of iMRMs. These methods will be evaluated using in silico modeling techniques.
Aim 2. We will develop methods to utilize iMRMs to predict conditions for wet-lab experiments to generate and test novel biological hypotheses. We will develop methods to utilize the iMRMs developed in Aim 1 to suggest wet-lab experiments to perturb aspects of metabolic networks currently under-explored in publically available datasets. This will allow researchers to uncover novel biology, further refine iMRMs, and link metabolic reactions to genes of unknown function. We will perform targeted wet lab experiments in model organisms to produce RNA-seq transcriptome data and validate specific hypotheses of gene function.
Aim 3. We will develop and apply novel approaches to utilize MMs and iMRMs on thousands of organisms to better understand metabolic and regulatory diversity. Given current access to thousands of MMs (see below) and access to iMRMs for many organisms (outcome of Aim 1), we will be in a unique position to use these MMs and iMRMs in the development and application of methods to explore the impacts of environment, evolution and community on metabolic and regulatory diversity.