ResearchPad - methodology-article https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Transfer posterior error probability estimation for peptide identification]]> https://www.researchpad.co/article/elastic_article_12116 In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores.ResultsThe transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods.ConclusionsWe presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields. ]]> <![CDATA[An automated aquatic rack system for rearing marine invertebrates]]> https://www.researchpad.co/article/elastic_article_12087 One hundred years ago, marine organisms were the dominant systems for the study of developmental biology. The challenges in rearing these organisms outside of a marine setting ultimately contributed to a shift towards work on a smaller number of so-called model systems. Those animals are typically non-marine organisms with advantages afforded by short life cycles, high fecundity, and relative ease in laboratory culture. However, a full understanding of biodiversity, evolution, and anthropogenic effects on biological systems requires a broader survey of development in the animal kingdom. To this day, marine organisms remain relatively understudied, particularly the members of the Lophotrochozoa (Spiralia), which include well over one third of the metazoan phyla (such as the annelids, mollusks, flatworms) and exhibit a tremendous diversity of body plans and developmental modes. To facilitate studies of this group, we have previously described the development and culture of one lophotrochozoan representative, the slipper snail Crepidula atrasolea, which is easy to rear in recirculating marine aquaria. Lab-based culture and rearing of larger populations of animals remain a general challenge for many marine organisms, particularly for inland laboratories.ResultsHere, we describe the development of an automated marine aquatic rack system for the high-density culture of marine species, which is particularly well suited for rearing filter-feeding animals. Based on existing freshwater recirculating aquatic rack systems, our system is specific to the needs of marine organisms and incorporates robust filtration measures to eliminate wastes, reducing the need for regular water changes. In addition, this system incorporates sensors and associated equipment for automated assessment and adjustment of water quality. An automated feeding system permits precise delivery of liquid food (e.g., phytoplankton) throughout the day, mimicking real-life feeding conditions that contribute to increased growth rates and fecundity.ConclusionThis automated system makes laboratory culture of marine animals feasible for both large and small research groups, significantly reducing the time, labor, and overall costs needed to rear these organisms. ]]> <![CDATA[Detecting PCOS susceptibility loci from genome-wide association studies via iterative trend correlation based feature screening]]> https://www.researchpad.co/article/elastic_article_12077 Feature screening plays a critical role in handling ultrahigh dimensional data analyses when the number of features exponentially exceeds the number of observations. It is increasingly common in biomedical research to have case-control (binary) response and an extremely large-scale categorical features. However, the approach considering such data types is limited in extant literature. In this article, we propose a new feature screening approach based on the iterative trend correlation (ITC-SIS, for short) to detect important susceptibility loci that are associated with the polycystic ovary syndrome (PCOS) affection status by screening 731,442 SNP features that were collected from the genome-wide association studies.ResultsWe prove that the trend correlation based screening approach satisfies the theoretical strong screening consistency property under a set of reasonable conditions, which provides an appealing theoretical support for its outperformance. We demonstrate that the finite sample performance of ITC-SIS is accurate and fast through various simulation designs.ConclusionITC-SIS serves as a good alternative method to detect disease susceptibility loci for clinic genomic data. ]]> <![CDATA[Power analysis for RNA-Seq differential expression studies using generalized linear mixed effects models]]> https://www.researchpad.co/article/elastic_article_9713 Power analysis becomes an inevitable step in experimental design of current biomedical research. Complex designs allowing diverse correlation structures are commonly used in RNA-Seq experiments. However, the field currently lacks statistical methods to calculate sample size and estimate power for RNA-Seq differential expression studies using such designs. To fill the gap, simulation based methods have a great advantage by providing numerical solutions, since theoretical distributions of test statistics are typically unavailable for such designs.ResultsIn this paper, we propose a novel simulation based procedure for power estimation of differential expression with the employment of generalized linear mixed effects models for correlated expression data. We also propose a new procedure for power estimation of differential expression with the use of a bivariate negative binomial distribution for paired designs. We compare the performance of both the likelihood ratio test and Wald test under a variety of simulation scenarios with the proposed procedures. The simulated distribution was used to estimate the null distribution of test statistics in order to achieve the desired false positive control and was compared to the asymptotic Chi-square distribution. In addition, we applied the procedure for paired designs to the TCGA breast cancer data set.ConclusionsIn summary, we provide a framework for power estimation of RNA-Seq differential expression under complex experimental designs. Simulation results demonstrate that both the proposed procedures properly control the false positive rate at the nominal level. ]]> <![CDATA[PretiMeth: precise prediction models for DNA methylation based on single methylation mark]]> https://www.researchpad.co/article/elastic_article_9105 The computational prediction of methylation levels at single CpG resolution is promising to explore the methylation levels of CpGs uncovered by existing array techniques, especially for the 450 K beadchip array data with huge reserves. General prediction models concentrate on improving the overall prediction accuracy for the bulk of CpG loci while neglecting whether each locus is precisely predicted. This leads to the limited application of the prediction results, especially when performing downstream analysis with high precision requirements.ResultsHere we reported PretiMeth, a method for constructing precise prediction models for each single CpG locus. PretiMeth used a logistic regression algorithm to build a prediction model for each interested locus. Only one DNA methylation feature that shared the most similar methylation pattern with the CpG locus to be predicted was applied in the model. We found that PretiMeth outperformed other algorithms in the prediction accuracy, and kept robust across platforms and cell types. Furthermore, PretiMeth was applied to The Cancer Genome Atlas data (TCGA), the intensive analysis based on precise prediction results showed that several CpG loci and genes (differentially methylated between the tumor and normal samples) were worthy for further biological validation.ConclusionThe precise prediction of single CpG locus is important for both methylation array data expansion and downstream analysis of prediction results. PretiMeth achieved precise modeling for each CpG locus by using only one significant feature, which also suggested that our precise prediction models could be probably used for reference in the probe set design when the DNA methylation beadchip update. PretiMeth is provided as an open source tool via https://github.com/JxTang-bioinformatics/PretiMeth. ]]> <![CDATA[CSN: unsupervised approach for inferring biological networks based on the genome alone]]> https://www.researchpad.co/article/elastic_article_8972 Most organisms cannot be cultivated, as they live in unique ecological conditions that cannot be mimicked in the lab. Understanding the functionality of those organisms’ genes and their interactions by performing large-scale measurements of transcription levels, protein-protein interactions or metabolism, is extremely difficult and, in some cases, impossible. Thus, efficient algorithms for deciphering genome functionality based only on the genomic sequences with no other experimental measurements are needed.ResultsIn this study, we describe a novel algorithm that infers gene networks that we name Common Substring Network (CSN). The algorithm enables inferring novel regulatory relations among genes based only on the genomic sequence of a given organism and partial homolog/ortholog-based functional annotation. It can specifically infer the functional annotation of genes with unknown homology.This approach is based on the assumption that related genes, not necessarily homologs, tend to share sub-sequences, which may be related to common regulatory mechanisms, similar functionality of encoded proteins, common evolutionary history, and more.We demonstrate that CSNs, which are based on S. cerevisiae and E. coli genomes, have properties similar to ‘traditional’ biological networks inferred from experiments. Highly expressed genes tend to have higher degree nodes in the CSN, genes with similar protein functionality tend to be closer, and the CSN graph exhibits a power-law degree distribution. Also, we show how the CSN can be used for predicting gene interactions and functions.ConclusionsThe reported results suggest that ‘silent’ code inside the transcript can help to predict central features of biological networks and gene function. This approach can help researchers to understand the genome of novel microorganisms, analyze metagenomic data, and can help to decipher new gene functions.AvailabilityOur MATLAB implementation of CSN is available at https://www.cs.tau.ac.il/~tamirtul/CSN-Autogen ]]> <![CDATA[Development of a surgical procedure for removal of a placentome from a pregnant ewe during gestation]]> https://www.researchpad.co/article/elastic_article_8495 In recent decades, there has been a growing interest in the impact of insults during pregnancy on postnatal health and disease. It is known that changes in placental development can impact fetal growth and subsequent susceptibility to adult onset diseases; however, a method to collect sufficient placental tissues for both histological and gene expression analyses during gestation without compromising the pregnancy has not been described. The ewe is an established biomedical model for the study of fetal development. Due to its cotyledonary placental type, the sheep has potential for surgical removal of materno-fetal exchange tissues, i.e., placentomes. A novel surgical procedure was developed in well-fed control ewes to excise a single placentome at mid-gestation.ResultsA follow-up study was performed in a cohort of nutrient-restricted ewes to investigate rapid placental changes in response to undernutrition. The surgery averaged 19 min, and there were no viability differences between control and sham ewes. Nutrient restricted fetuses were smaller than controls (4.7 ± 0.1 kg vs. 5.6 ± 0.2 kg; P < 0.05), with greater dam weight loss (− 32.4 ± 1.3 kg vs. 14.2 ± 2.2 kg; P < 0.01), and smaller placentomes at necropsy (5.7 ± 0.3 g vs. 7.2 ± 0.9 g; P < 0.05). Weight of sampled placentomes and placentome numbers did not differ.ConclusionsWith this technique, gestational studies in the sheep model will provide insight into the onset and complexity of changes in gene expression in placentomes resulting from undernutrition (as described in our study), overnutrition, alcohol or substance abuse, and environmental or disease factors of relevance and concern regarding the reproductive health and developmental origins of health and disease in humans and in animals. ]]> <![CDATA[A random forest based computational model for predicting novel lncRNA-disease associations]]> https://www.researchpad.co/article/Neaf3fca6-41a2-4cac-978b-5db6a45f1097

Background

Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources.

Results

To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models.

Conclusions

Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.

]]>
<![CDATA[Novel approach in whole genome mining and transcriptome analysis reveal conserved RiPPs in Trichoderma spp]]> https://www.researchpad.co/article/N07e4cbdc-b3ae-49a3-9bb6-563996407d27

Background

Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a highly diverse group of secondary metabolites (SM) of bacterial and fungal origin. While RiPPs have been intensively studied in bacteria, little is known about fungal RiPPs. In Fungi only six classes of RiPPs are described. Current strategies for genome mining are based on these six known classes. However, the genes involved in the biosynthesis of theses RiPPs are normally organized in biosynthetic gene clusters (BGC) in fungi.

Results

Here we describe a comprehensive strategy to mine fungal genomes for RiPPs by combining and adapting existing tools (e.g. antiSMASH and RiPPMiner) followed by extensive manual curation based on conserved domain identification, (comparative) phylogenetic analysis, and RNASeq data. Deploying this strategy, we could successfully rediscover already known fungal RiPPs. Further, we analysed four fungal genomes from the Trichoderma genus. We were able to find novel potential RiPP BGCs in Trichoderma using our unconventional mining approach.

Conclusion

We demonstrate that the unusual mining approach using tools developed for bacteria can be used in fungi, when carefully curated. Our study is the first report of the potential of Trichoderma to produce RiPPs, the detected clusters encode novel uncharacterized RiPPs. The method described in our study will lead to further mining efforts in all subdivisions of the fungal kingdom.

]]>
<![CDATA[Exploiting sequence labeling framework to extract document-level relations from biomedical texts]]> https://www.researchpad.co/article/N90579c7d-574e-4b8c-9c6a-064d86b2aa90

Background

Both intra- and inter-sentential semantic relations in biomedical texts provide valuable information for biomedical research. However, most existing methods either focus on extracting intra-sentential relations and ignore inter-sentential ones or fail to extract inter-sentential relations accurately and regard the instances containing entity relations as being independent, which neglects the interactions between relations. We propose a novel sequence labeling-based biomedical relation extraction method named Bio-Seq. In the method, sequence labeling framework is extended by multiple specified feature extractors so as to facilitate the feature extractions at different levels, especially at the inter-sentential level. Besides, the sequence labeling framework enables Bio-Seq to take advantage of the interactions between relations, and thus, further improves the precision of document-level relation extraction.

Results

Our proposed method obtained an F1-score of 63.5% on BioCreative V chemical disease relation corpus, and an F1-score of 54.4% on inter-sentential relations, which was 10.5% better than the document-level classification baseline. Also, our method achieved an F1-score of 85.1% on n2c2-ADE sub-dataset.

Conclusion

Sequence labeling method can be successfully used to extract document-level relations, especially for boosting the performance on inter-sentential relation extraction. Our work can facilitate the research on document-level biomedical text mining.

]]>
<![CDATA[Identification and quantification of virulence factors of enterotoxigenic Escherichia coli by high-resolution melting curve quantitative PCR]]> https://www.researchpad.co/article/5989db5dab0ee8fa60be051c

Background

Diagnosis of enterotoxigenic E. coli (ETEC) associated diarrhea is complicated by the diversity of E.coli virulence factors. This study developed a multiplex quantitative PCR assay based on high-resolution melting curves analysis (HRM-qPCR) to identify and quantify genes encoding five ETEC fimbriae related to diarrhea in swine, i.e. K99, F41, F18, F6 and K88.

Methods

Five fimbriae expressed by ETEC were amplified in multiple HRM-qPCR reactions to allow simultaneous identification and quantification of five target genes. The assay was calibrated to allow quantification of the most abundant target gene, and validated by analysis of 30 samples obtained from piglets with diarrhea and healthy controls, and comparison to standard qPCR detection.

Results

The five amplicons with melting temperatures (Tm) ranging from 74.7 ± 0.06 to 80.5 ± 0.15 °C were well-separated by HRM-qPCR. The area of amplicons under the melting peak correlated linearly to the proportion of the template in the calibration mixture if the proportion exceeded 4.8% (K88) or <1% (all other amplicons). The suitability of the method was evaluated using 30 samples from weaned pigs aged 6–7 weeks; 14 of these animals suffered from diarrhea in consequence of poor sanitary conditions. Genes encoding fimbriae and enterotoxins were quantified by HRM-qPCR and/or qPCR. The multiplex HRM-qPCR allowed accurate analysis when the total gene copy number of targets was more than 1 × 105 / g wet feces and the HRM curves were able to simultaneously distinguish fimbriae genes in the fecal samples. The relative quantification of the most abundant F18 based on melting peak area was highly correlated (P < 0.001; r2 = 0.956) with that of individual qPCR result but the correlation for less abundant fimbriae was much lower.

Conclusions

The multiplex HRM assay identifies ETEC virulence factors specifically and efficiently. It correctly indicated the predominant fimbriae type and additionally provides information of presence/ absence of other fimbriae types and it could find broad applications for pathogen diagnosis.

]]>
<![CDATA[Enhancing fragment-based protein structure prediction by customising fragment cardinality according to local secondary structure]]> https://www.researchpad.co/article/N75a9123b-8746-4344-a4f9-21892d6bd0fa

Background

Whenever suitable template structures are not available, usage of fragment-based protein structure prediction becomes the only practical alternative as pure ab initio techniques require massive computational resources even for very small proteins. However, inaccuracy of their energy functions and their stochastic nature imposes generation of a large number of decoys to explore adequately the solution space, limiting their usage to small proteins. Taking advantage of the uneven complexity of the sequence-structure relationship of short fragments, we adjusted the fragment insertion process by customising the number of available fragment templates according to the expected complexity of the predicted local secondary structure. Whereas the number of fragments is kept to its default value for coil regions, important and dramatic reductions are proposed for beta sheet and alpha helical regions, respectively.

Results

The evaluation of our fragment selection approach was conducted using an enhanced version of the popular Rosetta fragment-based protein structure prediction tool. It was modified so that the number of fragment candidates used in Rosetta could be adjusted based on the local secondary structure. Compared to Rosetta’s standard predictions, our strategy delivered improved first models, + 24% and + 6% in terms of GDT, when using 2000 and 20,000 decoys, respectively, while reducing significantly the number of fragment candidates. Furthermore, our enhanced version of Rosetta is able to deliver with 2000 decoys a performance equivalent to that produced by standard Rosetta while using 20,000 decoys. We hypothesise that, as the fragment insertion process focuses on the most challenging regions, such as coils, fewer decoys are needed to explore satisfactorily conformation spaces.

Conclusions

Taking advantage of the high accuracy of sequence-based secondary structure predictions, we showed the value of that information to customise the number of candidates used during the fragment insertion process of fragment-based protein structure prediction. Experimentations conducted using standard Rosetta showed that, when using the recommended number of decoys, i.e. 20,000, our strategy produces better results. Alternatively, similar results can be achieved using only 2000 decoys. Consequently, we recommend the adoption of this strategy to either improve significantly model quality or reduce processing times by a factor 10.

]]>
<![CDATA[Hierarchical discovery of large-scale and focal copy number alterations in low-coverage cancer genomes]]> https://www.researchpad.co/article/N1daffb36-6c96-4b42-9188-cf0e57891150

Background

Detection of DNA copy number alterations (CNAs) is critical to understand genetic diversity, genome evolution and pathological conditions such as cancer. Cancer genomes are plagued with widespread multi-level structural aberrations of chromosomes that pose challenges to discover CNAs of different length scales, and distinct biological origins and functions. Although several computational tools are available to identify CNAs using read depth (RD) signal, they fail to distinguish between large-scale and focal alterations due to inaccurate modeling of the RD signal of cancer genomes. Additionally, RD signal is affected by overdispersion-driven biases at low coverage, which significantly inflate false detection of CNA regions.

Results

We have developed CNAtra framework to hierarchically discover and classify ‘large-scale’ and ‘focal’ copy number gain/loss from a single whole-genome sequencing (WGS) sample. CNAtra first utilizes a multimodal-based distribution to estimate the copy number (CN) reference from the complex RD profile of the cancer genome. We implemented Savitzky-Golay smoothing filter and Modified Varri segmentation to capture the change points of the RD signal. We then developed a CN state-driven merging algorithm to identify the large segments with distinct copy numbers. Next, we identified focal alterations in each large segment using coverage-based thresholding to mitigate the adverse effects of signal variations. Using cancer cell lines and patient datasets, we confirmed CNAtra’s ability to detect and distinguish the segmental aneuploidies and focal alterations. We used realistic simulated data for benchmarking the performance of CNAtra against other single-sample detection tools, where we artificially introduced CNAs in the original cancer profiles. We found that CNAtra is superior in terms of precision, recall and f-measure. CNAtra shows the highest sensitivity of 93 and 97% for detecting large-scale and focal alterations respectively. Visual inspection of CNAs revealed that CNAtra is the most robust detection tool for low-coverage cancer data.

Conclusions

CNAtra is a single-sample CNA detection tool that provides an analytical and visualization framework for CNA profiling without relying on any reference control. It can detect chromosome-level segmental aneuploidies and high-confidence focal alterations, even from low-coverage data. CNAtra is an open-source software implemented in MATLAB®. It is freely available at https://github.com/AISKhalil/CNAtra.

]]>
<![CDATA[A unified nomenclature for vertebrate olfactory receptors]]> https://www.researchpad.co/article/N7c983842-6067-4a67-a4d0-373a2ee21b2d

Background

Olfactory receptors (ORs) are G protein-coupled receptors with a crucial role in odor detection. A typical mammalian genome harbors ~ 1000 OR genes and pseudogenes; however, different gene duplication/deletion events have occurred in each species, resulting in complex orthology relationships. While the human OR nomenclature is widely accepted and based on phylogenetic classification into 18 families and further into subfamilies, for other mammals different and multiple nomenclature systems are currently in use, thus concealing important evolutionary and functional insights.

Results

Here, we describe the Mutual Maximum Similarity (MMS) algorithm, a systematic classifier for assigning a human-centric nomenclature to any OR gene based on inter-species hierarchical pairwise similarities. MMS was applied to the OR repertoires of seven mammals and zebrafish. Altogether, we assigned symbols to 10,249 ORs. This nomenclature is supported by both phylogenetic and synteny analyses. The availability of a unified nomenclature provides a framework for diverse studies, where textual symbol comparison allows immediate identification of potential ortholog groups as well as species-specific expansions/deletions; for example, Or52e5 and Or52e5b represent a rat-specific duplication of OR52E5. Another example is the complete absence of OR subfamily OR6Z among primate OR symbols. In other mammals, OR6Z members are located in one genomic cluster, suggesting a large deletion in the great ape lineage. An additional 14 mammalian OR subfamilies are missing from the primate genomes. While in chimpanzee 87% of the symbols were identical to human symbols, this number decreased to ~ 50% in dog and cow and to ~ 30% in rodents, reflecting the adaptive changes of the OR gene superfamily across diverse ecological niches. Application of the proposed nomenclature to zebrafish revealed similarity to mammalian ORs that could not be detected from the current zebrafish olfactory receptor gene nomenclature.

Conclusions

We have consolidated a unified standard nomenclature system for the vertebrate OR superfamily. The new nomenclature system will be applied to cow, horse, dog and chimpanzee by the Vertebrate Gene Nomenclature Committee and its implementation is currently under consideration by other relevant species-specific nomenclature committees.

]]>
<![CDATA[PathME: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data]]> https://www.researchpad.co/article/N9d4d3970-275f-4921-8d8b-416215db7905

Background

Recent years have witnessed an increasing interest in multi-omics data, because these data allow for better understanding complex diseases such as cancer on a molecular system level. In addition, multi-omics data increase the chance to robustly identify molecular patient sub-groups and hence open the door towards a better personalized treatment of diseases. Several methods have been proposed for unsupervised clustering of multi-omics data. However, a number of challenges remain, such as the magnitude of features and the large difference in dimensionality across different omics data sources.

Results

We propose a multi-modal sparse denoising autoencoder framework coupled with sparse non-negative matrix factorization to robustly cluster patients based on multi-omics data. The proposed model specifically leverages pathway information to effectively reduce the dimensionality of omics data into a pathway and patient specific score profile. In consequence, our method allows us to understand, which pathway is a feature of which particular patient cluster. Moreover, recently proposed machine learning techniques allow us to disentangle the specific impact of each individual omics feature on a pathway score. We applied our method to cluster patients in several cancer datasets using gene expression, miRNA expression, DNA methylation and CNVs, demonstrating the possibility to obtain biologically plausible disease subtypes characterized by specific molecular features. Comparison against several competing methods showed a competitive clustering performance. In addition, post-hoc analysis of somatic mutations and clinical data provided supporting evidence and interpretation of the identified clusters.

Conclusions

Our suggested multi-modal sparse denoising autoencoder approach allows for an effective and interpretable integration of multi-omics data on pathway level while addressing the high dimensional character of omics data. Patient specific pathway score profiles derived from our model allow for a robust identification of disease subgroups.

]]>
<![CDATA[Robust high-throughput assays to assess discrete steps in ubiquitination and related cascades]]> https://www.researchpad.co/article/N311636ff-1e73-46f9-ab8b-95d07dea5c74

Background

Ubiquitination and ubiquitin-like protein post-translational modifications play an enormous number of roles in cellular processes. These modifications are constituted of multistep reaction cascades. Readily implementable and robust methods to evaluate each step of the overall process, while presently limited, are critical to the understanding and modulation of the reaction sequence at any desired level, both in terms of basic research and potential therapeutic drug discovery and development.

Results

We developed multiple robust and reliable high-throughput assays to interrogate each of the sequential discrete steps in the reaction cascade leading to protein ubiquitination. As models for the E1 ubiquitin-activating enzyme, the E2 ubiquitin-conjugating enzyme, the E3 ubiquitin ligase, and their ultimate substrate of ubiquitination in a cascade, we examined Uba1, Rad6, Rad18, and proliferating cell nuclear antigen (PCNA), respectively, in reconstituted systems. Identification of inhibitors of this pathway holds promise in cancer therapy since PCNA ubiquitination plays a central role in DNA damage tolerance and resulting mutagenesis. The luminescence-based assays we developed allow for the quantitative determination of the degree of formation of ubiquitin thioester conjugate intermediates with both E1 and E2 proteins, autoubiquitination of the E3 protein involved, and ubiquitination of the final substrate. Thus, all covalent adducts along the cascade can be individually probed. We tested previously identified inhibitors of this ubiquitination cascade, finding generally good correspondence between compound potency trends determined by more traditional low-throughput methods and the present high-throughput ones.

Conclusions

These approaches are readily adaptable to other E1, E2, and E3 systems, and their substrates in both ubiquitination and ubiquitin-like post-translational modification cascades.

]]>
<![CDATA[Functional comparison of PBMCs isolated by Cell Preparation Tubes (CPT) vs. Lymphoprep Tubes]]> https://www.researchpad.co/article/N62d6c43d-f8a6-42bc-b64d-132202361aec

Background

Cryopreserved human peripheral blood mononuclear cells (PBMCs) are a commonly used sample type for a variety of immunological assays. Many factors can affect the quality of PBMCs, and careful consideration and validation of an appropriate PBMC isolation and cryopreservation method is important for well-designed clinical studies. A major point of divergence in PBMC isolation protocols is the collection of blood, either directly into vacutainers pre-filled with density gradient medium or the use of conical tubes containing a porous barrier to separate the density gradient medium from blood. To address potential differences in sample outcome, we isolated, cryopreserved, and compared PBMCs using parallel protocols differing only in the use of one of two common tube types for isolation.

Methods

Whole blood was processed in parallel using both Cell Preparation Tubes™ (CPT, BD Biosciences) and Lymphoprep™ Tubes (Axis-Shield) and assessed for yield and viability prior to cryopreservation. After thawing, samples were further examined by flow cytometry for cell yield, cell viability, frequency of 10 cell subsets, and capacity for stimulation-dependent CD4+ and CD8+ T cell intracellular cytokine production.

Results

No significant differences in cell recovery, viability, frequency of immune cell subsets, or T cell functionality between PBMC samples isolated using CPT or Lymphoprep tubes were identified.

Conclusion

CPT and Lymphoprep tubes are effective and comparable methods for PBMC isolation for immunological studies.

]]>
<![CDATA[Comparison of pathway and gene-level models for cancer prognosis prediction]]> https://www.researchpad.co/article/Na0ec71aa-99e3-4b2d-be2d-06649e3d2bb4

Background

Cancer prognosis prediction is valuable for patients and clinicians because it allows them to appropriately manage care. A promising direction for improving the performance and interpretation of expression-based predictive models involves the aggregation of gene-level data into biological pathways. While many studies have used pathway-level predictors for cancer survival analysis, a comprehensive comparison of pathway-level and gene-level prognostic models has not been performed. To address this gap, we characterized the performance of penalized Cox proportional hazard models built using either pathway- or gene-level predictors for the cancers profiled in The Cancer Genome Atlas (TCGA) and pathways from the Molecular Signatures Database (MSigDB).

Results

When analyzing TCGA data, we found that pathway-level models are more parsimonious, more robust, more computationally efficient and easier to interpret than gene-level models with similar predictive performance. For example, both pathway-level and gene-level models have an average Cox concordance index of ~ 0.85 for the TCGA glioma cohort, however, the gene-level model has twice as many predictors on average, the predictor composition is less stable across cross-validation folds and estimation takes 40 times as long as compared to the pathway-level model. When the complex correlation structure of the data is broken by permutation, the pathway-level model has greater predictive performance while still retaining superior interpretative power, robustness, parsimony and computational efficiency relative to the gene-level models. For example, the average concordance index of the pathway-level model increases to 0.88 while the gene-level model falls to 0.56 for the TCGA glioma cohort using survival times simulated from uncorrelated gene expression data.

Conclusion

The results of this study show that when the correlations among gene expression values are low, pathway-level analyses can yield better predictive performance, greater interpretative power, more robust models and less computational cost relative to a gene-level model. When correlations among genes are high, a pathway-level analysis provides equivalent predictive power compared to a gene-level analysis while retaining the advantages of interpretability, robustness and computational efficiency.

]]>
<![CDATA[A completeness-independent method for pre-selection of closely related genomes for species delineation in prokaryotes]]> https://www.researchpad.co/article/N9e83c2e0-c778-4eb8-8021-c00fe163ac18

Background

Whole-genome approaches are widely preferred for species delineation in prokaryotes. However, these methods require pairwise alignments and calculations at the whole-genome level and thus are computationally intensive. To address this problem, a strategy consisting of sieving (pre-selecting closely related genomes) followed by alignment and calculation has been proposed.

Results

Here, we initially test a published approach called “genome-wide tetranucleotide frequency correlation coefficient” (TETRA), which is specially tailored for sieving. Our results show that sieving by TETRA requires > 40% completeness for both genomes of a pair to yield > 95% sensitivity, indicating that TETRA is completeness-dependent. Accordingly, we develop a novel algorithm called “fragment tetranucleotide frequency correlation coefficient” (FRAGTE), which uses fragments rather than whole genomes for sieving. Our results show that FRAGTE achieves ~ 100% sensitivity and high specificity on simulated genomes, real genomes and metagenome-assembled genomes, demonstrating that FRAGTE is completeness-independent. Additionally, FRAGTE sieved a reduced number of total genomes for subsequent alignment and calculation to greatly improve computational efficiency for the process after sieving. Aside from this computational improvement, FRAGTE also reduces the computational cost for the sieving process. Consequently, FRAGTE extremely improves run efficiency for both the processes of sieving and after sieving (subsequent alignment and calculation) to together accelerate genome-wide species delineation.

Conclusions

FRAGTE is a completeness-independent algorithm for sieving. Due to its high sensitivity, high specificity, highly reduced number of sieved genomes and highly improved runtime, FRAGTE will be helpful for whole-genome approaches to facilitate taxonomic studies in prokaryotes.

]]>
<![CDATA[Identifying glycan motifs using a novel subtree mining approach]]> https://www.researchpad.co/article/Ncc54cf74-b998-4cc1-a657-8e3f24c72c40

Background

Glycans are complex sugar chains, crucial to many biological processes. By participating in binding interactions with proteins, glycans often play key roles in host–pathogen interactions. The specificities of glycan-binding proteins, such as lectins and antibodies, are governed by motifs within larger glycan structures, and improved characterisations of these determinants would aid research into human diseases. Identification of motifs has previously been approached as a frequent subtree mining problem, and we extend these approaches with a glycan notation that allows recognition of terminal motifs.

Results

In this work, we customised a frequent subtree mining approach by altering the glycan notation to include information on terminal connections. This allows specific identification of terminal residues as potential motifs, better capturing the complexity of glycan-binding interactions. We achieved this by including additional nodes in a graph representation of the glycan structure to indicate the presence or absence of a linkage at particular backbone carbon positions. Combining this frequent subtree mining approach with a state-of-the-art feature selection algorithm termed minimum-redundancy, maximum-relevance (mRMR), we have generated a classification pipeline that is trained on data from a glycan microarray. When applied to a set of commonly used lectins, the identified motifs were consistent with known binding determinants. Furthermore, logistic regression classifiers trained using these motifs performed well across most lectins examined, with a median AUC value of 0.89.

Conclusions

We present here a new subtree mining approach for the classification of glycan binding and identification of potential binding motifs. The Carbohydrate Classification Accounting for Restricted Linkages (CCARL) method will assist in the interpretation of glycan microarray experiments and will aid in the discovery of novel binding motifs for further experimental characterisation.

]]>