ResearchPad - original-papers https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Associations between paracetamol (acetaminophen) intake between 18 and 32 weeks gestation and neurocognitive outcomes in the child: A longitudinal cohort study]]> https://www.researchpad.co/article/elastic_article_7015 The majority of epidemiological studies concerning possible adverse effects of paracetamol (acetaminophen) in pregnancy have been focussed on childhood asthma. Initial results of a robust association have been confirmed in several studies. Recently, a few cohort studies have looked at particular neurocognitive outcomes, and several have implicated hyperactivity.ObjectivesIn order to confirm these findings, further information and results are required. Here, we assess whether paracetamol intake between 18 and 32 weeks gestation is associated with childhood behavioural and cognitive outcomes using a large population.MethodsData collected by the Avon Longitudinal Study of Parents and Children (ALSPAC) at 32 weeks gestation and referring to the period from 18 to 32 weeks, identified 43.9% of women having taken paracetamol. We used an exposome analysis first to determine the background factors associated with pregnant women taking the drug, and then allowed for those factors to assess associations with child outcomes (measured using regression analyses).ResultsWe identified 15 variables independently associated with taking paracetamol in this time period, which were used as potential confounders. Of the 135 neurocognitive variables considered, adjusting for the likelihood of false discovery, we identified 56 outcomes for adjusted analyses. Adjustment identified 12 showing independent associations with paracetamol use at P < .05, four of which were at P < .0001 (all related to child behaviours reported by the mother at 42 and 47 months; eg conduct problems: adjusted mean score + 0.22 (95% confidence interval 0.10, 0.33)). There were few associations with behavioural or neurocognitive outcomes after age 7‐8 years, whether reported by the mother or the teacher.ConclusionsIf paracetamol use in mid‐to‐late pregnancy has an adverse effect on child neurocognitive outcome, it appears to mainly relate to the pre‐school period. It is important that these results be tested using other datasets or methodologies before assuming that they are causal. ]]> <![CDATA[Loss of <i>BAP1</i> expression is associated with an immunosuppressive microenvironment in uveal melanoma, with implications for immunotherapy development]]> https://www.researchpad.co/article/elastic_article_6865 Immunotherapy using immune checkpoint inhibitors (ICIs) induces durable responses in many metastatic cancers. Metastatic uveal melanoma (mUM), typically occurring in the liver, is one of the most refractory tumours to ICIs and has dismal outcomes. Monosomy 3 (M3), polysomy 8q, and BAP1 loss in primary uveal melanoma (pUM) are associated with poor prognoses. The presence of tumour‐infiltrating lymphocytes (TILs) within pUM and surrounding mUM – and some evidence of clinical responses to adoptive TIL transfer – strongly suggests that UMs are indeed immunogenic despite their low mutational burden. The mechanisms that suppress TILs in pUM and mUM are unknown. We show that BAP1 loss is correlated with upregulation of several genes associated with suppressive immune responses, some of which build an immune suppressive axis, including HLA‐DR, CD38, and CD74. Further, single‐cell analysis of pUM by mass cytometry confirmed the expression of these and other markers revealing important functions of infiltrating immune cells in UM, most being regulatory CD8+ T lymphocytes and tumour‐associated macrophages (TAMs). Transcriptomic analysis of hepatic mUM revealed similar immune profiles to pUM with BAP1 loss, including the expression of IDO1. At the protein level, we observed TAMs and TILs entrapped within peritumoural fibrotic areas surrounding mUM, with increased expression of IDO1, PD‐L1, and β‐catenin (CTNNB1), suggesting tumour‐driven immune exclusion and hence the immunotherapy resistance. These findings aid the understanding of how the immune response is organised in BAP1 mUM, which will further enable functional validation of detected biomarkers and the development of focused immunotherapeutic approaches. © 2020 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of Pathological Society of Great Britain and Ireland.

]]>
<![CDATA[Fracture prevention: a population-based intervention delivered in primary care]]> https://www.researchpad.co/article/N0114ba6b-ec07-4a71-8db4-7f0af5cc34ec Osteoporosis is common, increasing as the population ages and has significant consequences including fracture. Effective treatments are available.AimTo support proactive fracture risk assessment (FRAX) and optimizing treatment for high-risk patients in primary care.DesignClinical cohortSettingNovember 2017 to November 2018, support was provided to 71 practices comprising 69 of 90 practices within two National Health Service Clinical Commissioning Groups areas. Total population 579 508 (207 263 aged over 50 years).ParticipantsFRAX (National Institute for Care and Clinical Excellence, NICE CG146) in (i) males aged 75 years and over, (ii) females aged 65 years and over, (iii) females aged under 65 years and males aged under 75 years with risk factors and (iv) under 50 years with major risk factors.ResultsA total of 158 946 met NICE CG146, 11 961 were coded with an osteoporosis diagnosis (7.5%), of those, 42% were prescribed treatment with a bone sparing agent (BSA). In total, 6942 were assessed to initiate BSA. Thirty percent of untreated osteoporosis diagnosis patients had never been prescribed BSA. Even when prescribed, 1700 people (35%) were for less than minimum recommended duration. Of the total 9784 patients within the FRAX recommended to treat threshold, 3197 (33%) were currently treated with BSA and 3684 (37%) had no history of ever receiving BSA. From untreated patients, expected incidence of 875 fractures over a 3-year period (approximately £3.4 million). Treatment would prevent 274 fractures (cost reduction: £1 274 045, with prescribing costs: saving £805 145 after 3 years of treatment).ConclusionUnderdiagnosis and suboptimal treatment of osteoporosis was identified. Results suggest that implementing NICE guidance and optimizing treatment options in practice is possible and could prevent significant fractures. ]]> <![CDATA[PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells]]> https://www.researchpad.co/article/N077019bc-e2ab-4a6e-be79-014b80a68a50 New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and implementation https://github.com/ShobiStassen/PARC.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[A Bayesian approach to accurate and robust signature detection on LINCS L1000 data]]> https://www.researchpad.co/article/Ndf113415-a091-4231-aa7f-f426e92c86c6 LINCS L1000 dataset contains numerous cellular expression data induced by large sets of perturbagens. Although it provides invaluable resources for drug discovery as well as understanding of disease mechanisms, the existing peak deconvolution algorithms cannot recover the accurate expression level of genes in many cases, inducing severe noise in the dataset and limiting its applications in biomedical studies.ResultsHere, we present a novel Bayesian-based peak deconvolution algorithm that gives unbiased likelihood estimations for peak locations and characterize the peaks with probability based z-scores. Based on the above algorithm, we build a pipeline to process raw data from L1000 assay into signatures that represent the features of perturbagen. The performance of the proposed pipeline is evaluated using similarity between the signatures of bio-replicates and the drugs with shared targets, and the results show that signatures derived from our pipeline gives a substantially more reliable and informative representation for perturbagens than existing methods. Thus, the new pipeline may significantly boost the performance of L1000 data in the downstream applications such as drug repurposing, disease modeling and gene function prediction.Availability and implementationThe code and the precomputed data for LINCS L1000 Phase II (GSE 70138) are available at https://github.com/njpipeorgan/L1000-bayesian.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[MDEHT: a multivariate approach for detecting differential expression of microRNA isoform data in RNA-sequencing studies]]> https://www.researchpad.co/article/Nb4b85f58-8f1f-402d-b975-b708b04d85ce miRNA isoforms (isomiRs) are produced from the same arm as the archetype miRNA with a few nucleotides different at 5 and/or 3 termini. These well-conserved isomiRs are functionally important and have contributed to the evolution of miRNA genes. Accurate detection of differential expression of miRNAs can bring new insights into the cellular function of miRNA and a further improvement in miRNA-based diagnostic and prognostic applications. However, very few methods take isomiR variations into account in the analysis of miRNA differential expression.ResultsTo overcome this challenge, we developed a novel approach to take advantage of the multidimensional structure of isomiR data from the same miRNAs, termed as a multivariate differential expression by Hotelling’s T2 test (MDEHT). The utilization of the information hidden in isomiRs enables MDEHT to increase the power of identifying differentially expressed miRNAs that are not marginally detectable in univariate testing methods. We conducted rigorous and unbiased comparisons of MDEHT with seven commonly used tools in simulated and real datasets from The Cancer Genome Atlas. Our comprehensive evaluations demonstrated that the MDEHT method was robust among various datasets and outperformed other commonly used tools in terms of Type I error rate, true positive rate and reproducibility.Availability and implementationThe source code for identifying and quantifying isomiRs and performing miRNA differential expression analysis is available at https://github.com/amanzju/MDEHT.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[ <i>BioSeqZip</i>: a collapser of NGS redundant reads for the optimization of sequence analysis]]> https://www.researchpad.co/article/N57483afe-1e29-4ccb-a124-b3461a285839 High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times.Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state.ResultsOur extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least.Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[atomium—a Python structure parser]]> https://www.researchpad.co/article/N48cdda5b-592b-40b2-a389-9dd18c3d3ef7 Structural biology relies on specific file formats to convey information about macromolecular structures. Traditionally this has been the PDB format, but increasingly newer formats, such as PDBML, mmCIF and MMTF are being used. Here we present atomium, a modern, lightweight, Python library for parsing, manipulating and saving PDB, mmCIF and MMTF file formats. In addition, we provide a web service, pdb2json, which uses atomium to give a consistent JSON representation to the entire Protein Data Bank.Availability and implementationatomium is implemented in Python and its performance is equivalent to the existing library BioPython. However, it has significant advantages in features and API design. atomium is available from atomium.bioinf.org.uk and pdb2json can be accessed at pdb2json.bioinf.org.ukSupplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[Bivartect: accurate and memory-saving breakpoint detection by direct read comparison]]> https://www.researchpad.co/article/Nef0c678a-8e44-48b4-b23e-6ad52ea03f7a Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives.ResultsHere we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy.Availability and implementationBivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs]]> https://www.researchpad.co/article/N2b7a7074-1354-4430-9fc5-152fc1131146 Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing.ResultsWe present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average.Availability and implementationSoftware implementation is available from https://github.com/jttoivon/moder2.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins]]> https://www.researchpad.co/article/N73e7f13f-c395-44d7-9bdc-13b11c06733e To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance.ResultsIn this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments.Availability and implementationLAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA).Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[A heuristic approach for detecting RNA H-type pseudoknots]]> https://www.researchpad.co/article/Nc1ada0ad-baf0-4264-aea3-28ea49d392a9 Motivation: RNA H-type pseudoknots are ubiquitous pseudoknots that are found in almost all classes of RNA and thought to play very important roles in a variety of biological processes. Detection of these RNA H-type pseudoknots can improve our understanding of RNA structures and their associated functions. However, the currently existing programs for detecting such RNA H-type pseudoknots are still time consuming and sometimes even ineffective. Therefore, efficient and effective tools for detecting the RNA H-type pseudoknots are needed.

Results: In this paper, we have adopted a heuristic approach to develop a novel tool, called HPknotter, for efficiently and accurately detecting H-type pseudoknots in an RNA sequence. In addition, we have demonstrated the applicability and effectiveness of HPknotter by testing on some sequences with known H-type pseudoknots. Our approach can be easily extended and applied to other classes of more general pseudoknots.

Availability: The web server of our HPknotter is available for online analysis at http://bioalgorithm.life.nctu.edu.tw/HPKNOTTER/

Contact: cllu@mail.nctu.edu.tw, chiu@cc.nctu.edu.tw

]]>
<![CDATA[Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection]]> https://www.researchpad.co/article/N2042bd9f-55ef-4b5f-abfd-004f60140823 Motivation: Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists from various disciplines worldwide is to study the specificity of cleavage activity of SARS-related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as P2P1P1′P2′, the conventional inductive programming method may result in a rule like ‘if P1 = Q, then the sub-sequence is cleaved, otherwise non-cleaved’. If the site P1 is not orthogonal to the others (for instance, P2, P1′ and P2′), the prediction power of these kind of rules may be limited. Therefore this study is aimed at developing a novel method for constructing non-orthogonal decision trees for mining protease data.

Result: Eighteen sequences of coronavirus polyprotein were downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites were experimentally determined. These sequences were scanned using a sliding window with size k to generate about 50 000 k-mer sub-sequences (for short, k-mers). The value of k varies from 4 to 12 with a gap of two. The bio-basis function proposed by Thomson et al. is used to transform the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about 10 out of 50 000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved.

Availability: The program for bio-mapping can be obtained by request to the author.

Contact: z.r.yang@exeter.ac.uk

]]>
<![CDATA[Object-oriented biological system integration: a SARS coronavirus example]]> https://www.researchpad.co/article/N07dc1ee1-1f7c-44c5-b6cd-ccd6c9bd30de Motivation: The importance of studying biology at the system level has been well recognized, yet there is no well-defined process or consistent methodology to integrate and represent biological information at this level. To overcome this hurdle, a blending of disciplines such as computer science and biology is necessary.

Results: By applying an adapted, sequential software engineering process, a complex biological system (severe acquired respiratory syndrome-coronavirus viral infection) has been reverse-engineered and represented as an object-oriented software system. The scalability of this object-oriented software engineering approach indicates that we can apply this technology for the integration of large complex biological systems.

Availability: A navigable web-based version of the system is freely available at http://people.musc.edu/~zhengw/SARS/Software-Process.htm

Contact: zhengw@musc.edu

Supplementary information: Supplemental data: Table 1 and Figures 1–16.

]]>
<![CDATA[Using evolutionary Expectation Maximization to estimate indel rates]]> https://www.researchpad.co/article/N6e575b19-0cc3-49c9-8c92-853643743f63 Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process.

Results: We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91’ model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm's close similarity to the Baum–Welch algorithm for training hidden Markov models, it can be used in an ‘unsupervised’ fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates.

Availability: Software implementing the algorithm and the benchmark is available under GPL from http://www.biowiki.org/

Contact: ihh@berkeley.edu

]]>
<![CDATA[UDSMProt: universal deep sequence models for protein classification]]> https://www.researchpad.co/article/N06ec7b02-5f84-40e3-9693-1aa1a2b9830a

Abstract

Motivation

Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step.

Results

We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies.

Availability and implementation

Source code is available under https://github.com/nstrodt/UDSMProt.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing]]> https://www.researchpad.co/article/N0a3e680c-f399-44d9-b477-5f250fd280f3

Abstract

Motivation

One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances.

Results

Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses.

Availability and implementation

The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Causal network perturbations for instance-specific analysis of single cell and disease samples]]> https://www.researchpad.co/article/N1cc2695a-94a2-4308-a8ed-78dbffed8cb0

Abstract

Motivation

Complex diseases involve perturbation in multiple pathways and a major challenge in clinical genomics is characterizing pathway perturbations in individual samples. This can lead to patient-specific identification of the underlying mechanism of disease thereby improving diagnosis and personalizing treatment. Existing methods rely on external databases to quantify pathway activity scores. This ignores the data dependencies and that pathways are incomplete or condition-specific.

Results

ssNPA is a new approach for subtyping samples based on deregulation of their gene networks. ssNPA learns a causal graph directly from control data. Sample-specific network neighborhood deregulation is quantified via the error incurred in predicting the expression of each gene from its Markov blanket. We evaluate the performance of ssNPA on liver development single-cell RNA-seq data, where the correct cell timing is recovered; and two TCGA datasets, where ssNPA patient clusters have significant survival differences. In all analyses ssNPA consistently outperforms alternative methods, highlighting the advantage of network-based approaches.

Availability and implementation

http://www.benoslab.pitt.edu/Software/ssnpa/.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[InterPep2: global peptide–protein docking using interaction surface templates]]> https://www.researchpad.co/article/Nfef8a40d-4904-4616-af10-61a4337a5711

Abstract

Motivation

Interactions between proteins and peptides or peptide-like intrinsically disordered regions are involved in many important biological processes, such as gene expression and cell life-cycle regulation. Experimentally determining the structure of such interactions is time-consuming and difficult because of the inherent flexibility of the peptide ligand. Although several prediction-methods exist, most are limited in performance or availability.

Results

InterPep2 is a freely available method for predicting the structure of peptide–protein interactions. Improved performance is obtained by using templates from both peptide–protein and regular protein–protein interactions, and by a random forest trained to predict the DockQ-score for a given template using sequence and structural features. When tested on 252 bound peptide–protein complexes from structures deposited after the complexes used in the construction of the training and templates sets of InterPep2, InterPep2-Refined correctly positioned 67 peptides within 4.0 Å LRMSD among top10, similar to another state-of-the-art template-based method which positioned 54 peptides correctly. However, InterPep2 displays a superior ability to evaluate the quality of its own predictions. On a previously established set of 27 non-redundant unbound-to-bound peptide–protein complexes, InterPep2 performs on-par with leading methods. The extended InterPep2-Refined protocol managed to correctly model 15 of these complexes within 4.0 Å LRMSD among top10, without using templates from homologs. In addition, combining the template-based predictions from InterPep2 with ab initio predictions from PIPER-FlexPepDock resulted in 22% more near-native predictions compared to the best single method (22 versus 18).

Availability and implementation

The program is available from: http://wallnerlab.org/InterPep2.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[PheGWAS: a new dimension to visualize GWAS across multiple phenotypes]]> https://www.researchpad.co/article/N596deaae-a8ce-4fc4-9255-0a794300adb7

Abstract

Motivation

PheGWAS was developed to enhance exploration of phenome-wide pleiotropy at the genome-wide level through the efficient generation of a dynamic visualization combining Manhattan plots from GWAS with PheWAS to create a 3D ‘landscape’. Pleiotropy in sub-surface GWAS significance strata can be explored in a sectional view plotted within user defined levels. Further complexity reduction is achieved by confining to a single chromosomal section. Comprehensive genomic and phenomic coordinates can be displayed.

Results

PheGWAS is demonstrated using summary data from Global Lipids Genetics Consortium GWAS across multiple lipid traits. For single and multiple traits PheGWAS highlighted all 88 and 69 loci, respectively. Further, the genes and SNPs reported in Global Lipids Genetics Consortium were identified using additional functions implemented within PheGWAS. Not only is PheGWAS capable of identifying independent signals but also provides insights to local genetic correlation (verified using HESS) and in identifying the potential regions that share causal variants across phenotypes (verified using colocalization tests).

Availability and implementation

The PheGWAS software and code are freely available at (https://github.com/georgeg0/PheGWAS).

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>