ResearchPad - genome-annotation https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Mitochondrial genome sequence of <i>Phytophthora sansomeana</i> and comparative analysis of <i>Phytophthora</i> mitochondrial genomes]]> https://www.researchpad.co/article/elastic_article_14567 Phytophthora sansomeana infects soybean and causes root rot. It was recently separated from the species complex P. megasperma sensu lato. In this study, we sequenced and annotated its complete mitochondrial genome and compared it to that of nine other Phytophthora species. The genome was assembled into a circular molecule of 39,618 bp with a 22.03% G+C content. Forty-two protein coding genes, 25 tRNA genes and two rRNA genes were annotated in this genome. The protein coding genes include 14 genes in the respiratory complexes, four ATP synthase genes, 16 ribosomal proteins genes, a tatC translocase gene, six conserved ORFs and a unique orf402. The tRNA genes encode tRNAs for 19 amino acids. Comparison among mitochondrial genomes of 10 Phytophthora species revealed three inversions, each covering multiple genes. These genomes were conserved in gene content with few exceptions. A 3' truncated atp9 gene was found in P. nicotianae. All 10 Phytophthora species, as well as other oomycetes and stramenopiles, lacked tRNA genes for threonine in their mitochondria. Phylogenomic analysis using the mitochondrial genomes supported or enhanced previous findings of the phylogeny of Phytophthora spp.

]]>
<![CDATA[An assessment of genome annotation coverage across the bacterial tree of life]]> https://www.researchpad.co/article/Ncb63aa35-dacb-42fa-a1c9-402a3005b91e Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.

]]>
<![CDATA[Protein composition of the occlusion bodies of Epinotia aporema granulovirus]]> https://www.researchpad.co/article/5c6c75e6d5eed0c4843d0423

Within family Baculoviridae, members of the Betabaculovirus genus are employed as biocontrol agents against lepidopteran pests, either alone or in combination with selected members of the Alphabaculovirus genus. Epinotia aporema granulovirus (EpapGV) is a fast killing betabaculovirus that infects the bean shoot borer (E. aporema) and is a promising biopesticide. Because occlusion bodies (OBs) play a key role in baculovirus horizontal transmission, we investigated the composition of EpapGV OBs. Using mass spectrometry-based proteomics we could identify 56 proteins that are included in the OBs during the final stages of larval infection. Our data provides experimental validation of several annotated hypothetical coding sequences. Proteogenomic mapping against genomic sequence detected a previously unannotated ac110-like core gene and a putative translation fusion product of ORFs epap48 and epap49. Comparative studies of the proteomes available for the family Baculoviridae highlight the conservation of core gene products as parts of the occluded virion. Two proteins specific for betabaculoviruses (Epap48 and Epap95) are incorporated into OBs. Moreover, quantification based on emPAI values showed that Epap95 is one of the most abundant components of EpapGV OBs.

]]>
<![CDATA[Apollo: Democratizing genome annotation]]> https://www.researchpad.co/article/5c648d41d5eed0c484c823a0

Genome annotation is the process of identifying the location and function of a genome's encoded features. Improving the biological accuracy of annotation is a complex and iterative process requiring researchers to review and incorporate multiple sources of information such as transcriptome alignments, predictive models based on sequence profiles, and comparisons to features found in related organisms. Because rapidly decreasing costs are enabling an ever-growing number of scientists to incorporate sequencing as a routine laboratory technique, there is widespread demand for tools that can assist in the deliberative analytical review of genomic information. To this end, we present Apollo, an open source software package that enables researchers to efficiently inspect and refine the precise structure and role of genomic features in a graphical browser-based platform. Some of Apollo’s newer user interface features include support for real-time collaboration, allowing distributed users to simultaneously edit the same encoded features while also instantly seeing the updates made by other researchers on the same region in a manner similar to Google Docs. Its technical architecture enables Apollo to be integrated into multiple existing genomic analysis pipelines and heterogeneous laboratory workflow platforms. Finally, we consider the implications that Apollo and related applications may have on how the results of genome research are published and made accessible.

]]>
<![CDATA[Analysis of genetic diversity and structure in a worldwide walnut (Juglans regia L.) germplasm using SSR markers]]> https://www.researchpad.co/article/5c06f052d5eed0c484c6d6e5

Persian or English walnut (Juglans regia L.), the walnut species cultivated for nut production, is one of the oldest food sources known and is grown worldwide in temperate areas. France is the 7th leading producer as of 2016 with 39 kt. Deciphering walnut genetic diversity and structure is important for efficient management and use of genetic resources. In this work, 253 worldwide accessions from the INRA walnut germplasm collection, containing English walnut and several related species, were genotyped using 13 SSR (Single Sequence Repeat) markers selected from the literature to assess diversity and structure. Genetic diversity parameters showed a deficiency of heterozygotes and, for several SSRs, allele-specificities among the accessions tested. Principal Coordinate Analysis (PCoA) showed the 253 accessions clustered in largely in agreement with the existing botanical classification of the genus. Among the 217 J. regia accessions, two main clusters, accessions from Eastern Europe and Asia, and accessions from Western Europe and America, were identified using STRUCTURE software. This was confirmed by Principal Coordinate Analysis and supported by Neighbor-Joining tree construction using DARwin software. Moreover, a substructure was found within the two clusters, mainly according to geographical origin. A core collection containing 50 accessions was selected using the maximum length sub-tree method and prior knowledge about their phenotype. The present study constitutes a preliminary population genetics overview of INRA walnut genetic resources collection using SSR markers. The resulting estimations of genetic diversity and structure are useful for germplasm management and for future walnut breeding programs.

]]>
<![CDATA[Parental high dietary arachidonic acid levels modulated the hepatic transcriptome of adult zebrafish (Danio rerio) progeny]]> https://www.researchpad.co/article/5b6da1c4463d7e4dccc5faf6

Disproportionate high intake of n-6 polyunsaturated fatty acids (PUFAs) in the diet is considered as a major human health concern. The present study examines changes in the hepatic gene expression pattern of adult male zebrafish progeny associated with high levels of the n-6 PUFA arachidonic acid (ARA) in the parental diet. The parental generation (F0) was fed a diet which was either low (control) or high in ARA (high ARA). Progenies of both groups (F1) were given the control diet. No differences in body weight were found between the diet groups within adult stages of either F0 or F1 generation. Few differentially expressed genes were observed between the two dietary groups in the F0 in contrast to the F1 generation. Several links were found between the previous metabolic analysis of the parental fish and the gene expression analysis in their adult progeny. Main gene expression differences in the progeny were observed related to lipid and retinoid metabolism by PPARα/RXRα playing a central role in mediating changes to lipid and long-chain fatty acid metabolism. The enrichment of genes involved in β-oxidation observed in the progeny, corresponded to the increase in peroxisomal β-oxidative degradation of long-chain fatty acids in the parental fish metabolomics data. Similar links between the F0 and F1 generation were identified for the methionine cycle and transsulfuration pathway in the high ARA group. In addition, estrogen signalling was found to be affected by parental high dietary ARA levels, where gene expression was opposite directed in F1 compared to F0. This study shows that the dietary n-3/n-6 PUFA ratio can alter gene expression patterns in the adult progeny. Whether the effect is mediated by permanent epigenetic mechanisms regulating gene expression in developing gametes needs to be further investigated.

]]>
<![CDATA[Allele Workbench: Transcriptome Pipeline and Interactive Graphics for Allele-Specific Expression]]> https://www.researchpad.co/article/5989dad7ab0ee8fa60bb87e5

Sequencing the transcriptome can answer various questions such as determining the transcripts expressed in a given species for a specific tissue or condition, evaluating differential expression, discovering variants, and evaluating allele-specific expression. Differential expression evaluates the expression differences between different strains, tissues, and conditions. Allele-specific expression evaluates expression differences between parental alleles. Both differential expression and allele-specific expression have been studied for heterosis (hybrid vigor), where the hybrid has improved performance over the parents for one or more traits. The Allele Workbench software was developed for a heterosis study that evaluated allele-specific expression for a mouse F1 hybrid using libraries from multiple tissues with biological replicates. This software has been made into a distributable package, which includes a pipeline, a Java interface to build the database, and a Java interface for query and display of the results. The required input is a reference genome, annotation file, and one or more RNA-Seq libraries with optional replicates. It evaluates allelic imbalance at the SNP and transcript level and flags transcripts with significant opposite directional allele-specific expression. The Java interface allows the user to view data from libraries, replicates, genes, transcripts, exons, and variants, including queries on allele imbalance for selected libraries. To determine the impact of allele-specific SNPs on protein folding, variants are annotated with their effect (e.g., missense), and the parental protein sequences may be exported for protein folding analysis. The Allele Workbench processing results in transcript files and read counts that can be used as input to the previously published Transcriptome Computational Workbench, which has a new algorithm for determining a trimmed set of gene ontology terms. The software with demo files is available from https://code.google.com/p/allele-workbench. Additionally, all software is ready for immediate use from an Atmosphere Virtual Machine Image available from the iPlant Collaborative (www.iplantcollaborative.org).

]]>
<![CDATA[Genome Analysis of Bacillus amyloliquefaciens Subsp. plantarum UCMB5113: A Rhizobacterium That Improves Plant Growth and Stress Management]]> https://www.researchpad.co/article/5989db00ab0ee8fa60bc678f

The Bacillus amyloliquefaciens subsp. plantarum strain UCMB5113 is a Gram-positive rhizobacterium that can colonize plant roots and stimulate plant growth and defense based on unknown mechanisms. This reinforcement of plants may provide protection to various forms of biotic and abiotic stress. To determine the genetic traits involved in the mechanism of plant-bacteria association, the genome sequence of UCMB5113 was obtained by assembling paired-end Illumina reads. The assembled chromosome of 3,889,532 bp was predicted to encode 3,656 proteins. Genes that potentially contribute to plant growth promotion such as indole-3-acetic acid (IAA) biosynthesis, acetoin synthesis and siderophore production were identified. Moreover, annotation identified putative genes responsible for non-ribosomal synthesis of secondary metabolites and genes supporting environment fitness of UCMB5113 including drug and metal resistance. A large number of genes encoding a diverse set of secretory proteins, enzymes of primary and secondary metabolism and carbohydrate active enzymes were found which reflect a high capacity to degrade various rhizosphere macromolecules. Additionally, many predicted membrane transporters provides the bacterium with efficient uptake capabilities of several nutrients. Although, UCMB5113 has the possibility to produce antibiotics and biosurfactants, the protective effect of plants to pathogens seems to be indirect and due to priming of plant induced systemic resistance. The availability of the genome enables identification of genes and their function underpinning beneficial interactions of UCMB5113 with plants.

]]>
<![CDATA[Signature Gene Expression Reveals Novel Clues to the Molecular Mechanisms of Dimorphic Transition in Penicillium marneffei]]> https://www.researchpad.co/article/5989da11ab0ee8fa60b79826

Systemic dimorphic fungi cause more than one million new infections each year, ranking them among the significant public health challenges currently encountered. Penicillium marneffei is a systemic dimorphic fungus endemic to Southeast Asia. The temperature-dependent dimorphic phase transition between mycelium and yeast is considered crucial for the pathogenicity and transmission of P. marneffei, but the underlying mechanisms are still poorly understood. Here, we re-sequenced P. marneffei strain PM1 using multiple sequencing platforms and assembled the genome using hybrid genome assembly. We determined gene expression levels using RNA sequencing at the mycelial and yeast phases of P. marneffei, as well as during phase transition. We classified 2,718 genes with variable expression across conditions into 14 distinct groups, each marked by a signature expression pattern implicated at a certain stage in the dimorphic life cycle. Genes with the same expression patterns tend to be clustered together on the genome, suggesting orchestrated regulations of the transcriptional activities of neighboring genes. Using qRT-PCR, we validated expression levels of all genes in one of clusters highly expressed during the yeast-to-mycelium transition. These included madsA, a gene encoding MADS-box transcription factor whose gene family is exclusively expanded in P. marneffei. Over-expression of madsA drove P. marneffei to undergo mycelial growth at 37°C, a condition that restricts the wild-type in the yeast phase. Furthermore, analyses of signature expression patterns suggested diverse roles of secreted proteins at different developmental stages and the potential importance of non-coding RNAs in mycelium-to-yeast transition. We also showed that RNA structural transition in response to temperature changes may be related to the control of thermal dimorphism. Together, our findings have revealed multiple molecular mechanisms that may underlie the dimorphic transition in P. marneffei, providing a powerful foundation for identifying molecular targets for mechanism-based interventions.

]]>
<![CDATA[Transcription Start Site Associated RNAs (TSSaRNAs) Are Ubiquitous in All Domains of Life]]> https://www.researchpad.co/article/5989da25ab0ee8fa60b805bb

A plethora of non-coding RNAs has been discovered using high-resolution transcriptomics tools, indicating that transcriptional and post-transcriptional regulation is much more complex than previously appreciated. Small RNAs associated with transcription start sites of annotated coding regions (TSSaRNAs) are pervasive in both eukaryotes and bacteria. Here, we provide evidence for existence of TSSaRNAs in several archaeal transcriptomes including: Halobacterium salinarum, Pyrococcus furiosus, Methanococcus maripaludis, and Sulfolobus solfataricus. We validated TSSaRNAs from the model archaeon Halobacterium salinarum NRC-1 by deep sequencing two independent small-RNA enriched (RNA-seq) and a primary-transcript enriched (dRNA-seq) strand-specific libraries. We identified 652 transcripts, of which 179 were shown to be primary transcripts (∼7% of the annotated genome). Distinct growth-associated expression patterns between TSSaRNAs and their cognate genes were observed, indicating a possible role in environmental responses that may result from RNA polymerase with varying pausing rhythms. This work shows that TSSaRNAs are ubiquitous across all domains of life.

]]>
<![CDATA[NSIT: Novel Sequence Identification Tool]]> https://www.researchpad.co/article/5989db02ab0ee8fa60bc6ebb

Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires 2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.

]]>
<![CDATA[Genomes of Gardnerella Strains Reveal an Abundance of Prophages within the Bladder Microbiome]]> https://www.researchpad.co/article/5989db01ab0ee8fa60bc6bb4

Bacterial surveys of the vaginal and bladder human microbiota have revealed an abundance of many similar bacterial taxa. As the bladder was once thought to be sterile, the complex interactions between microbes within the bladder have yet to be characterized. To initiate this process, we have begun sequencing isolates, including the clinically relevant genus Gardnerella. Herein, we present the genomic sequences of four Gardnerella strains isolated from the bladders of women with symptoms of urgency urinary incontinence; these are the first Gardnerella genomes produced from this niche. Congruent to genomic characterization of Gardnerella isolates from the reproductive tract, isolates from the bladder reveal a large pangenome, as well as evidence of high frequency horizontal gene transfer. Prophage gene sequences were found to be abundant amongst the strains isolated from the bladder, as well as amongst publicly available Gardnerella genomes from the vagina and endometrium, motivating an in depth examination of these sequences. Amongst the 39 Gardnerella strains examined here, there were more than 400 annotated prophage gene sequences that we could cluster into 95 homologous groups; 49 of these groups were unique to a single strain. While many of these prophages exhibited no sequence similarity to any lytic phage genome, estimation of the rate of phage acquisition suggests both vertical and horizontal acquisition. Furthermore, bioinformatic evidence indicates that prophage acquisition is ongoing within both vaginal and bladder Gardnerella populations. The abundance of prophage sequences within the strains examined here suggests that phages could play an important role in the species’ evolutionary history and in its interactions within the complex communities found in the female urinary and reproductive tracts.

]]>
<![CDATA[Genome Annotation Provides Insight into Carbon Monoxide and Hydrogen Metabolism in Rubrivivax gelatinosus]]> https://www.researchpad.co/article/5989d9d8ab0ee8fa60b66aba

We report here the sequencing and analysis of the genome of the purple non-sulfur photosynthetic bacterium Rubrivivax gelatinosus CBS. This microbe is a model for studies of its carboxydotrophic life style under anaerobic condition, based on its ability to utilize carbon monoxide (CO) as the sole carbon substrate and water as the electron acceptor, yielding CO2 and H2 as the end products. The CO-oxidation reaction is known to be catalyzed by two enzyme complexes, the CO dehydrogenase and hydrogenase. As expected, analysis of the genome of Rx. gelatinosus CBS reveals the presence of genes encoding both enzyme complexes. The CO-oxidation reaction is CO-inducible, which is consistent with the presence of two putative CO-sensing transcription factors in its genome. Genome analysis also reveals the presence of two additional hydrogenases, an uptake hydrogenase that liberates the electrons in H2 in support of cell growth, and a regulatory hydrogenase that senses H2 and relays the signal to a two-component system that ultimately controls synthesis of the uptake hydrogenase. The genome also contains two sets of hydrogenase maturation genes which are known to assemble the catalytic metallocluster of the hydrogenase NiFe active site. Collectively, the genome sequence and analysis information reveals the blueprint of an intricate network of signal transduction pathways and its underlying regulation that enables Rx. gelatinosus CBS to thrive on CO or H2 in support of cell growth.

]]>
<![CDATA[Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies]]> https://www.researchpad.co/article/5989db27ab0ee8fa60bd0875

Extensive efforts have been made to understand genomic function through both experimental and computational approaches, yet proper annotation still remains challenging, especially in non-coding regions. In this manuscript, we introduce GenoSkyline, an unsupervised learning framework to predict tissue-specific functional regions through integrating high-throughput epigenetic annotations. GenoSkyline successfully identified a variety of non-coding regulatory machinery including enhancers, regulatory miRNA, and hypomethylated transposable elements in extensive case studies. Integrative analysis of GenoSkyline annotations and results from genome-wide association studies (GWAS) led to novel biological insights on the etiologies of a number of human complex traits. We also explored using tissue-specific functional annotations to prioritize GWAS signals and predict relevant tissue types for each risk locus. Brain and blood-specific annotations led to better prioritization performance for schizophrenia than standard GWAS p-values and non-tissue-specific annotations. As for coronary artery disease, heart-specific functional regions was highly enriched of GWAS signals, but previously identified risk loci were found to be most functional in other tissues, suggesting a substantial proportion of still undetected heart-related loci. In summary, GenoSkyline annotations can guide genetic studies at multiple resolutions and provide valuable insights in understanding complex diseases. GenoSkyline is available at http://genocanyon.med.yale.edu/GenoSkyline.

]]>
<![CDATA[Accelerating the Switchgrass (Panicum virgatum L.) Breeding Cycle Using Genomic Selection Approaches]]> https://www.researchpad.co/article/5989dabcab0ee8fa60baf31e

Switchgrass (Panicum virgatum L.) is a perennial grass undergoing development as a biofuel feedstock. One of the most important factors hindering breeding efforts in this species is the need for accurate measurement of biomass yield on a per-hectare basis. Genomic selection on simple-to-measure traits that approximate biomass yield has the potential to significantly speed up the breeding cycle. Recent advances in switchgrass genomic and phenotypic resources are now making it possible to evaluate the potential of genomic selection of such traits. We leveraged these resources to study the ability of three widely-used genomic selection models to predict phenotypic values of morphological and biomass quality traits in an association panel consisting of predominantly northern adapted upland germplasm. High prediction accuracies were obtained for most of the traits, with standability having the highest ten-fold cross validation prediction accuracy (0.52). Moreover, the morphological traits generally had higher prediction accuracies than the biomass quality traits. Nevertheless, our results suggest that the quality of current genomic and phenotypic resources available for switchgrass is sufficiently high for genomic selection to significantly impact breeding efforts for biomass yield.

]]>
<![CDATA[Robust Identification of Noncoding RNA from Transcriptomes Requires Phylogenetically-Informed Sampling]]> https://www.researchpad.co/article/5989db1bab0ee8fa60bce27d

Noncoding RNAs are integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a null hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

]]>
<![CDATA[Leveraging Genomic Annotations and Pleiotropic Enrichment for Improved Replication Rates in Schizophrenia GWAS]]> https://www.researchpad.co/article/5989da2dab0ee8fa60b83131

Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic (“z-score”) of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a “relative enrichment score” for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8). There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3.

]]>
<![CDATA[MacSyFinder: A Program to Mine Genomes for Molecular Systems with an Application to CRISPR-Cas Systems]]> https://www.researchpad.co/article/5989dab4ab0ee8fa60bac514

Motivation

Biologists often wish to use their knowledge on a few experimental models of a given molecular system to identify homologs in genomic data. We developed a generic tool for this purpose.

Results

Macromolecular System Finder (MacSyFinder) provides a flexible framework to model the properties of molecular systems (cellular machinery or pathway) including their components, evolutionary associations with other systems and genetic architecture. Modelled features also include functional analogs, and the multiple uses of a same component by different systems. Models are used to search for molecular systems in complete genomes or in unstructured data like metagenomes. The components of the systems are searched by sequence similarity using Hidden Markov model (HMM) protein profiles. The assignment of hits to a given system is decided based on compliance with the content and organization of the system model. A graphical interface, MacSyView, facilitates the analysis of the results by showing overviews of component content and genomic context. To exemplify the use of MacSyFinder we built models to detect and class CRISPR-Cas systems following a previously established classification. We show that MacSyFinder allows to easily define an accurate “Cas-finder” using publicly available protein profiles.

Availability and Implementation

MacSyFinder is a standalone application implemented in Python. It requires Python 2.7, Hmmer and makeblastdb (version 2.2.28 or higher). It is freely available with its source code under a GPLv3 license at https://github.com/gem-pasteur/macsyfinder. It is compatible with all platforms supporting Python and Hmmer/makeblastdb. The “Cas-finder” (models and HMM profiles) is distributed as a compressed tarball archive as Supporting Information.

]]>
<![CDATA[Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads]]> https://www.researchpad.co/article/5989da99ab0ee8fa60ba3281

To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.

]]>
<![CDATA[The Influence of Promoter Architectures and Regulatory Motifs on Gene Expression in Escherichia coli]]> https://www.researchpad.co/article/5989d9f0ab0ee8fa60b6e53c

The ability to regulate gene expression is of central importance for the adaptability of living organisms to changes in their external and internal environment. At the transcriptional level, binding of transcription factors (TFs) in the promoter region can modulate the transcription rate, hence making TFs central players in gene regulation. For some model organisms, information about the locations and identities of discovered TF binding sites have been collected in continually updated databases, such as RegulonDB for the well-studied case of E. coli. In order to reveal the general principles behind the binding-site arrangement and function of these regulatory architectures we propose a random promoter architecture model that preserves the overall abundance of binding sites to identify overrepresented binding site configurations. This model is analogous to the random network model used in the study of genetic network motifs, where regulatory motifs are identified through their overrepresentation with respect to a “randomly connected” genetic network. Using our model we identify TF pairs which coregulate operons in an overrepresented fashion, or individual TFs which act at multiple binding sites per promoter by, for example, cooperative binding, DNA looping, or through multiple binding domains. We furthermore explore the relationship between promoter architecture and gene expression, using three different genome-wide protein copy number censuses. Perhaps surprisingly, we find no systematic correlation between the number of activator and repressor binding sites regulating a gene and the level of gene expression. A position-weight-matrix model used to estimate the binding affinity of RNA polymerase (RNAP) to the promoters of activated and repressed genes suggests that this lack of correlation might in part be due to differences in basal transcription levels, with repressed genes having a higher basal activity level. This quantitative catalogue relating promoter architecture and function provides a first step towards genome-wide predictive models of regulatory function.

]]>