ResearchPad - computational-techniques https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Automatic three-dimensional reconstruction of fascicles in peripheral nerves from histological images]]> https://www.researchpad.co/article/elastic_article_14591 Computational studies can be used to support the development of peripheral nerve interfaces, but currently use simplified models of nerve anatomy, which may impact the applicability of simulation results. To better quantify and model neural anatomy across the population, we have developed an algorithm to automatically reconstruct accurate peripheral nerve models from histological cross-sections. We acquired serial median nerve cross-sections from human cadaveric samples, staining one set with hematoxylin and eosin (H&E) and the other using immunohistochemistry (IHC) with anti-neurofilament antibody. We developed a four-step processing pipeline involving registration, fascicle detection, segmentation, and reconstruction. We compared the output of each step to manual ground truths, and additionally compared the final models to commonly used extrusions, via intersection-over-union (IOU). Fascicle detection and segmentation required the use of a neural network and active contours in H&E-stained images, but only simple image processing methods for IHC-stained images. Reconstruction achieved an IOU of 0.42±0.07 for H&E and 0.37±0.16 for IHC images, with errors partially attributable to global misalignment at the registration step, rather than poor reconstruction. This work provides a quantitative baseline for fully automatic construction of peripheral nerve models. Our models provided fascicular shape and branching information that would be lost via extrusion.

]]>
<![CDATA[RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment]]> https://www.researchpad.co/article/N67fc2065-7e6a-4783-aab9-eb74d3ac0a95

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n3) time and O(n2) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.

]]>
<![CDATA[Disease-relevant mutations alter amino acid co-evolution networks in the second nucleotide binding domain of CFTR]]> https://www.researchpad.co/article/N211c75a7-eaac-4644-b655-cac4e239c2e4

Cystic Fibrosis (CF) is an inherited disease caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR) ion channel. Mutations in CFTR cause impaired chloride ion transport in the epithelial tissues of patients leading to cardiopulmonary decline and pancreatic insufficiency in the most severely affected patients. CFTR is composed of twelve membrane-spanning domains, two nucleotide-binding domains (NBDs), and a regulatory domain. The most common mutation in CFTR is a deletion of phenylalanine at position 508 (ΔF508) in NBD1. Previous research has primarily concentrated on the structure and dynamics of the NBD1 domain; However numerous pathological mutations have also been found in the lesser-studied NBD2 domain. We have investigated the amino acid co-evolved network of interactions in NBD2, and the changes that occur in that network upon the introduction of CF and CF-related mutations (S1251N(T), S1235R, D1270N, N1303K(T)). Extensive coupling between the α- and β-subdomains were identified with residues in, or near Walker A, Walker B, H-loop and C-loop motifs. Alterations in the predicted residue network varied from moderate for the S1251T perturbation to more severe for N1303T. The S1235R and D1270N networks varied greatly compared to the wildtype, but these CF mutations only affect ion transport preference and do not severely disrupt CFTR function, suggesting dynamic flexibility in the network of interactions in NBD2. Our results also suggest that inappropriate interactions between the β-subdomain and Q-loop could be detrimental. We also identified mutations predicted to stabilize the NBD2 residue network upon introduction of the CF and CF-related mutations, and these predicted mutations are scored as benign by the MUTPRED2 algorithm. Our results suggest the level of disruption of the co-evolution predictions of the amino acid networks in NBD2 does not have a straightforward correlation with the severity of the CF phenotypes observed.

]]>
<![CDATA[PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility]]> https://www.researchpad.co/article/5c8823b3d5eed0c484638e7d

Introduction

Phylogenetic analysis plays a crucial role in quality control in the HIV drug resistance testing laboratory. If previous patient sequence data is available sample swaps can be detected and investigated. As Antiretroviral treatment coverage is increasing in many developing countries, so is the need for HIV drug resistance testing. In countries with multiple languages, transcription errors are easily made with patient identifiers. Here a self-contained blastn integrated phylogenetic pipeline can be especially useful. Even though our pipeline can run on any unix based system, a Raspberry Pi 3 is used here as a very affordable and integrated solution.

Performance benchmarks

The computational capability of this single board computer is demonstrated as well as the utility thereof in the HIV drug resistance laboratory. Benchmarking analysis against a large public database shows excellent time performance with minimal user intervention. This pipeline also contains utilities to find previous sequences as well as phylogenetic analysis and a graphical sequence mapping utility against the pol area of the HIV HXB2 reference genome. Sequence data from the Los Alamos HIV database was analyzed for inter- and intra-patient diversity and logistic regression was conducted on the calculated genetic distances. These findings show that allowable clustering and genetic distance between viral sequences from different patients is very dependent on subtype as well as the area of the viral genome being analyzed.

Availability

The Raspberry Pi image for PhyloPi, source code of the pipeline, sequence data, bash-, python- and R-scripts for the logistic regression, benchmarking as well as helper scripts are available at http://scholar.ufs.ac.za:8080/xmlui/handle/11660/7638 and https://github.com/ArmandBester/phylopi. The PhyloPi image and the source code are published under the GPLv3 license. A demo version of the PhyloPi pipeline is available at http://phylopi.hpc.ufs.ac.za/.

]]>
<![CDATA[16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses]]> https://www.researchpad.co/article/5c7ee7c5d5eed0c4848f4d9c

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

]]>
<![CDATA[BioJava 5: A community driven open-source bioinformatics library]]> https://www.researchpad.co/article/5c6730bad5eed0c484f37fa8

BioJava is an open-source project that provides a Java library for processing biological data. The project aims to simplify bioinformatic analyses by implementing parsers, data structures, and algorithms for common tasks in genomics, structural biology, ontologies, phylogenetics, and more. Since 2012, we have released two major versions of the library (4 and 5) that include many new features to tackle challenges with increasingly complex macromolecular structure data. BioJava requires Java 8 or higher and is freely available under the LGPL 2.1 license. The project is hosted on GitHub at https://github.com/biojava/biojava. More information and documentation can be found online on the BioJava website (http://www.biojava.org) and tutorial (https://github.com/biojava/biojava-tutorial). All inquiries should be directed to the GitHub page or the BioJava mailing list (http://lists.open-bio.org/mailman/listinfo/biojava-l).

]]>
<![CDATA[Complex harmonic regularization with differential evolution in a memetic framework for biomarker selection]]> https://www.researchpad.co/article/5c6f1534d5eed0c48467aebf

For studying cancer and genetic diseases, the issue of identifying high correlation genes from high-dimensional data is an important problem. It is a great challenge to select relevant biomarkers from gene expression data that contains some important correlation structures, and some of the genes can be divided into different groups with a common biological function, chromosomal location or regulation. In this paper, we propose a penalized accelerated failure time model CHR-DE using a non-convex regularization (local search) with differential evolution (global search) in a wrapper-embedded memetic framework. The complex harmonic regularization (CHR) can approximate to the combination p(12p<1) and q (1 ≤ q < 2) for selecting biomarkers in group. And differential evolution (DE) is utilized to globally optimize the CHR’s hyperparameters, which make CHR-DE achieve strong capability of selecting groups of genes in high-dimensional biological data. We also developed an efficient path seeking algorithm to optimize this penalized model. The proposed method is evaluated on synthetic and three gene expression datasets: breast cancer, hepatocellular carcinoma and colorectal cancer. The experimental results demonstrate that CHR-DE is a more effective tool for feature selection and learning prediction.

]]>
<![CDATA[A combined computational strategy of sequence and structural analysis predicts the existence of a functional eicosanoid pathway in Drosophila melanogaster]]> https://www.researchpad.co/article/5c6c7583d5eed0c4843cfe40

This study reports on a putative eicosanoid biosynthesis pathway in Drosophila melanogaster and challenges the currently held view that mechanistic routes to synthesize eicosanoid or eicosanoid-like biolipids do not exist in insects, since to date, putative fly homologs of most mammalian enzymes have not been identified. Here we use systematic and comprehensive bioinformatics approaches to identify most of the mammalian eicosanoid synthesis enzymes. Sensitive sequence analysis techniques identified candidate Drosophila enzymes that share low global sequence identities with their human counterparts. Twenty Drosophila candidates were selected based upon (a) sequence identity with human enzymes of the cyclooxygenase and lipoxygenase branches, (b) similar domain architecture and structural conservation of the catalytic domain, and (c) presence of potentially equivalent functional residues. Evaluation of full-length structural models for these 20 top-scoring Drosophila candidates revealed a surprising degree of conservation in their overall folds and potential analogs for functional residues in all 20 enzymes. Although we were unable to identify any suitable candidate for lipoxygenase enzymes, we report structural homology models of three fly cyclooxygenases. Our findings predict that the D. melanogaster genome likely codes for one or more pathways for eicosanoid or eicosanoid-like biolipid synthesis. Our study suggests that classical and/or novel eicosanoids mediators must regulate biological functions in insects–predictions that can be tested with the power of Drosophila genetics. Such experimental analysis of eicosanoid biology in a simple model organism will have high relevance to human development and health.

]]>
<![CDATA[elPrep 4: A multithreaded framework for sequence analysis]]> https://www.researchpad.co/article/5c6dc9a8d5eed0c484529f91

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep’s parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

]]>
<![CDATA[Evolution of the modular, disordered stress proteins known as dehydrins]]> https://www.researchpad.co/article/5c648d51d5eed0c484c8254c

Dehydrins, plant proteins that are upregulated during dehydration stress conditions, have modular sequences that can contain three conserved motifs (the Y-, S-, and K-segments). The presence and order of these motifs are used to classify dehydrins into one of five architectures: Kn, SKn, KnS, YnKn, and YnSKn, where the subscript n describes the number of copies of that motif. In this study, an architectural and phylogenetic analysis was performed on 426 dehydrin sequences that were identified in 53 angiosperm and 3 gymnosperm genomes. It was found that angiosperms contained all five architectures, while gymnosperms only contained Kn and SKn dehydrins. This suggests that the ancestral dehydrin in spermatophytes was either Kn or SKn, and the Y-segment containing dehydrins first arose in angiosperms. A high-level split between the YnSKn dehydrins from either the Kn or SKn dehydrins could not be confidently identified, however, two lower level architectural divisions appear to have occurred after different duplication events. The first likely occurred after a whole genome duplication, resulting in the duplication of a Y3SK2 dehydrin; the duplicate subsequently lost an S- and K- segment to become a Y3K1 dehydrin. The second split occurred after a tandem duplication of a Y1SK2 dehydrin, where the duplicate lost both the Y- and S- segment and gained four K-segments, resulting in a K6 dehydrin. We suggest that the newly arisen Y3K1 dehydrin is possibly on its way to pseudogenization, while the newly arisen K6 dehydrin developed a novel function in cold protection.

]]>
<![CDATA[Designing and running an advanced Bioinformatics and genome analyses course in Tunisia]]> https://www.researchpad.co/article/5c58d660d5eed0c484031d37

Genome data, with underlying new knowledge, are accumulating at exponential rate thanks to ever-improving sequencing technologies and the parallel development of dedicated efficient Bioinformatics methods and tools. Advanced Education in Bioinformatics and Genome Analyses is to a large extent not accessible to students in developing countries where endeavors to set up Bioinformatics courses concern most often only basic levels. Here, we report a pioneering pilot experience concerning the design and implementation, from scratch, of a three-months advanced and extensive course in Bioinformatics and Genome Analyses in the Institut Pasteur de Tunis. Most significantly the outcome of the course was upgrading the participants’ skills in Bioinformatics and Genome Analyses to recognized international standards. Here we detail the different steps involved in the implementation of this course as well as the topics covered in the program. The description of this pilot experience might be helpful for the implementation of other similar educational projects, notably in developing countries, aiming to go beyond basics and providing young researchers with high-level skills.

]]>
<![CDATA[Molecular epidemiology of Blastocystis isolated from animals in the state of Rio de Janeiro, Brazil]]> https://www.researchpad.co/article/5c57e692d5eed0c484ef37b2

The enteric protist Blastocystis is one of the most frequently reported parasites infecting both humans and many other animal hosts worldwide. A remarkable genetic diversity has been observed in the species, with 17 different subtypes (STs) on a molecular phylogeny based on small subunit RNA genes (SSU rDNA). Nonetheless, information regarding its distribution, diversity and zoonotic potential remains still scarce, especially in groups other than primates. In Brazil, only a few surveys limited to human isolates have so far been conducted on Blastocystis STs. The aim of this study is to determine the occurrence of Blastocystis subtypes in non-human vertebrate and invertebrate animal groups in different areas of the state of Rio de Janeiro, Brazil. A total of 334 stool samples were collected from animals representing 28 different genera. Blastocystis cultivated samples were subtyped using nuclear small subunit ribosomal DNA (SSU rDNA) sequencing. Phylogenetic analyses and BLAST searches revealed six subtypes: ST5 (28.8%), ST2 (21.1%), ST1 and ST8 (19.2%), ST3 (7.7%) and ST4 (3.8%). Our findings indicate a considerable overlap between STs in humans and other animals. This highlights the importance of investigating a range of hosts for Blastocystis to understand the eco-epidemiological aspects of the parasite and its host specificity.

]]>
<![CDATA[Distribution of Scedosporium species in soil from areas with high human population density and tourist popularity in six geographic regions in Thailand]]> https://www.researchpad.co/article/5c521862d5eed0c484797eef

Scedosporium is a genus comprising at least 10 species of airborne fungi (saprobes) that survive and grow on decaying organic matter. These fungi are found in high density in human-affected areas such as sewage-contaminated water, and five species, namely Scedosporium apiospermum, S. boydii, S. aurantiacum, S. dehoogii, and S. minutisporum, cause human infections. Thailand is a popular travel destination in the world, with many attractions present in densely populated areas; thus, large numbers of people may be exposed to pathogens present in these areas. We conducted a comprehensive survey of Scedosporium species in 350 soil samples obtained from 35 sites of high human population density and tourist popularity distributed over 23 provinces and six geographic regions of Thailand. Soil suspensions of each sample were inoculated on three plates of Scedo-Select III medium to isolate Scedosporium species. In total, 191 Scedosporium colonies were isolated from four provinces. The species were then identified using PCR and sequencing of the beta-tubulin (BT2) gene. Of the 191 isolates, 188 were S. apiospermum, one was S. dehoogii, and species of two could not be exactly identified. Genetic diversity analysis revealed high haplotype diversity of S. apiospermum. Soil is a major ecological niche for Scedosporium and may contain S. apiospermum populations with high genetic diversity. This study of Scedosporium distribution might encourage health care providers to consider Scedosporium infection in their patients.

]]>
<![CDATA[A computer-based incentivized food basket choice tool: Presentation and evaluation]]> https://www.researchpad.co/article/5c40f766d5eed0c48438606d

Objective

To develop and evaluate a low-cost computer-based tool to elicit dietary choices in an incentive compatible manner, which can be used on-line or as part of a laboratory study.

Methods

The study was conducted with around 255 adults. Respondents were asked to allocate a fixed monetary budget across a choice of around a hundred grocery items with the prospect of receiving these items with some probability delivered to their home by a real supermarket. The tool covers a broad range of food items, allows inference of macro-nutrients and calories, and allows the researcher to fix the choice set participants can choose from. We compare the information derived from our incentivized tool, and compare it to alternative low-cost ways of measuring dietary intake, namely the food frequency questionnaire and a one-shot version of the 24-hour dietary recall, which are both based on self-reports. We compare the calorie intake indicators derived from each tool with a number of biometric measures for each subject, namely weight, body-mass-index (BMI) and waist size.

Results

The results show that the dietary information collected is only weakly correlated across the three methods. We find that only the calorie intake measure from our incentivized tool is positively and significantly related to each of the biometric indicators. Specifically, a 10% increase in calorie intake is associated with a 1.5% increase in BMI. By contrast, we find no significant correlations for either of the two measures based on self-reports.

Conclusion

The computer-based tool is a promising new, low-cost measure of dietary choices, particularly in one-shot situations where such behaviours are only observed once, whereas other tools like 24-hour dietary recalls and food frequency questionnaires may be more suited when they are administered repeatedly. The tool may be useful for research conducted with limited time and budget.

]]>
<![CDATA[Statistical investigations of protein residue direct couplings]]> https://www.researchpad.co/article/5c33c39fd5eed0c48459e46b

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.

]]>
<![CDATA[Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons]]> https://www.researchpad.co/article/5c1c0ab4d5eed0c484426918

Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how FLEA may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.

]]>
<![CDATA[Reduced RNA expression of the FMR1 gene in women with low (CGGn<26) repeats]]> https://www.researchpad.co/article/5c26976fd5eed0c48470f84f

Low FMR1 variants (CGGn<26) have been associated with premature ovarian aging, female infertility and poor IVF treatment success. Until now, there is little published information concerning possible molecular mechanisms for this effect. We wished to examine whether relative expression of RNA and the FMR1 gene’s fragile X mental retardation protein (FMRP) RNA isoforms differ in women with various FMR1 sub-genotypes (normal, low CGGn<26 and/or high CGGn≥34). This prospective cohort study was conducted between 2014 and 2017 in a clinical research unit of the Center for Human Reproduction in New York City. The study involved a total of 98 study subjects, including 18 young oocyte donors and 80 older infertility patients undergoing routine in vitro fertilization (IVF) cycles. The main outcome measure was RNA expression in human luteinized granulosa cells of 5 groups of FMRP isoforms. The relative expression of FMR1 RNA in human luteinized granulosa cells was measured by real-time PCR and a possible association with CGGn was explored. All 5 groups of FMRP RNA isoforms examined were found to be differentially expressed in human luteinized granulosa cells. The relative expression of four FMR1 RNA isoforms showed significant differences among 6 FMR1 sub-genotypes. Women with at least one low allele expressed significantly lower levels of all 5 sets of FRMP isoforms in comparison to the non-low group. While it would be of interest to see whether FMRP is also decreased in the low-group we recognize that in recent years it has been increasingly documented that information flow of genetics may be regulated by non-coding RNA, that is, without translation to a protein product. We, thus, conclude that various CGG expansions of FMR1 allele may lead to changes of RNA levels and ratios of distinct FMRP RNA isoforms, which could regulate the translation and/or cellular localization of FMRP, affect the expression of steroidogenic enzymes and hormonal receptors, or act in some other epigenetic process and therefore result in the ovarian dysfunction in infertility.

]]>
<![CDATA[Coevolving residues inform protein dynamics profiles and disease susceptibility of nSNVs]]> https://www.researchpad.co/article/5c09945dd5eed0c4842aeb26

The conformational dynamics of proteins is rarely used in methodologies used to predict the impact of genetic mutations due to the paucity of three-dimensional protein structures as compared to the vast number of available sequences. Until now a three-dimensional (3D) structure has been required to predict the conformational dynamics of a protein. We introduce an approach that estimates the conformational dynamics of a protein, without relying on structural information. This de novo approach utilizes coevolving residues identified from a multiple sequence alignment (MSA) using Potts models. These coevolving residues are used as contacts in a Gaussian network model (GNM) to obtain protein dynamics. B-factors calculated using sequence-based GNM (Seq-GNM) are in agreement with crystallographic B-factors as well as theoretical B-factors from the original GNM that utilizes the 3D structure. Moreover, we demonstrate the ability of the calculated B-factors from the Seq-GNM approach to discriminate genomic variants according to their phenotypes for a wide range of proteins. These results suggest that protein dynamics can be approximated based on sequence information alone, making it possible to assess the phenotypes of nSNVs in cases where a 3D structure is unknown. We hope this work will promote the use of dynamics information in genetic disease prediction at scale by circumventing the need for 3D structures.

]]>
<![CDATA[De novo transcriptome assembly of the Chinese pearl barley, adlay, by full-length isoform and short-read RNA sequencing]]> https://www.researchpad.co/article/5c19669bd5eed0c484b525fb

Adlay (Coix lacryma-jobi) is a tropical grass that has long been used in traditional Chinese medicine and is known for its nutritional benefits. Recent studies have shown that vitamin E compounds in adlay protect against chronic diseases such as cancer and heart disease. However, the molecular basis of adlay's health benefits remains unknown. Here, we generated adlay gene sets by de novo transcriptome assembly using long-read isoform sequencing (Iso-Seq) and short-read RNA-Sequencing (RNA-Seq). The gene sets obtained from Iso-seq and RNA-seq contained 31,177 genes and 57,901 genes, respectively. We confirmed the validity of the assembled gene sets by experimentally analyzing the levels of prolamin and vitamin E biosynthesis-associated proteins in adlay plant tissues and seeds. We compared the screened adlay genes with known gene families from closely related plant species, such as rice, sorghum and maize. We also identified tissue-specific genes from the adlay leaf, root, and young and mature seed, and experimentally validated the differential expression of 12 randomly-selected genes. Our study of the adlay transcriptome will provide a valuable resource for genetic studies that can enhance adlay breeding programs in the future.

]]>
<![CDATA[De novo protein structure prediction using ultra-fast molecular dynamics simulation]]> https://www.researchpad.co/article/5bfdb391d5eed0c4845ca84a

Modern genomics sequencing techniques have provided a massive amount of protein sequences, but experimental endeavor in determining protein structures is largely lagging far behind the vast and unexplored sequences. Apparently, computational biology is playing a more important role in protein structure prediction than ever. Here, we present a system of de novo predictor, termed NiDelta, building on a deep convolutional neural network and statistical potential enabling molecular dynamics simulation for modeling protein tertiary structure. Combining with evolutionary-based residue-contacts, the presented predictor can predict the tertiary structures of a number of target proteins with remarkable accuracy. The proposed approach is demonstrated by calculations on a set of eighteen large proteins from different fold classes. The results show that the ultra-fast molecular dynamics simulation could dramatically reduce the gap between the sequence and its structure at atom level, and it could also present high efficiency in protein structure determination if sparse experimental data is available.

]]>