ResearchPad - decomposition https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Evaluation of residue management practices on barley residue decomposition]]> https://www.researchpad.co/article/elastic_article_13875 Optimizing barley (hordeum vulgare L.) production in Idaho and other parts of the Pacific Northwest (PNW) should focus on farm resource management. The effect of post-harvest residue management on barley residue decomposition has not been adequately studied. Thus, the objective of this study was to determine the effect of residue placement (surface vs. incorporated), residue size (chopped vs. ground-sieved) and soil type (sand and sandy loam) on barley residue decomposition. A 50-day(d) laboratory incubation experiment was conducted at a temperature of 25°C at the Aberdeen Research and Extension Center, Aberdeen, Idaho, USA. Following the study, a Markov-Chain Monte Carlo (MCMC) modeling approach was applied to investigate the first-order decay kinetics of barley residue. An accelerated initial flush of residue carbon(C)-mineralization was measured for the sieved (Day 1) compared to chopped (Day 3 to 5) residues for both surface incorporated applications. The highest evolution of carbon dioxide (CO2)-C of 8.3 g kg-1 dry residue was observed on Day 1 from the incorporated-sieved application for both soils. The highest and lowest amount of cumulative CO2-C released and percentage residue decomposed over 50-d was observed for surface-chopped (107 g kg-1 dry residue and 27%, respectively) and incorporated-sieved (69 g kg-1 dry residue and 18%, respectively) residues, respectively. There were no significant differences in C-mineralization from barley residue based on soil type or its interactions with residue placement and size (p >0.05). The largest decay constant k of 0.0083 d-1 was calculated for surface-chopped residue where the predicted half-life was 80 d, which did not differ from surface sieved or incorporated chopped. In contrast, incorporated-sieved treatments only resulted in a k of 0.0054 d-1 and would need an additional 48 d to decompose 50% of the residue. Future residue decomposition studies under field conditions are warranted to verify the residue C-mineralization and its impact on residue management.

]]>
<![CDATA[Impact of confinement in vehicle trunks on decomposition and entomological colonization of carcasses]]> https://www.researchpad.co/article/Nffbdbe54-85a9-48b9-9e05-57433aec6303

In order to investigate the impact of confinement in a car trunk on decomposition and insect colonization of carcasses, three freshly killed pig (Sus scrofa domesticus Erxleben) carcasses were placed individually in the trunks of older model cars and deployed in a forested area in the southwestern region of British Columbia, Canada, together with three freshly killed carcasses which were exposed in insect-accessible protective cages in the same forest. Decomposition rate and insect colonization of all carcasses were examined twice a week for four weeks. The exposed carcasses were colonized immediately by Calliphora latifrons Hough and Calliphora vomitoria (L.) followed by Lucilia illustris (Meigen), Phormia regina (Meigen) and Protophormia terraenovae (R.-D.) (Diptera: Calliphoridae). There was a delay of three to six days before the confined carcasses were colonized, first by P. regina, followed by Pr. terraenovae. These species represented the vast majority of blow fly species on the confined carcasses. Despite the delay in colonization, decomposition progressed much more rapidly in two of the confined carcasses in comparison with the exposed carcasses due to the greatly increased temperatures inside the vehicles, with the complete skeletonization of two of the confined carcasses ocurring between nine and 13 days after death. One confined carcass was an anomaly, attracting much fewer insects, supporting fewer larval calliphorids and decomposing much more slowly than other carcasses, despite similarly increased temperatures. It was later discovered that the vehicle in which this carcass was confined had a solid metal fire wall between the passenger area and the trunk, which served to reduce insect access and release of odors. These data may be extremely valuable when analyzing cadavers found inside vehicle trunks.

]]>
<![CDATA[RNAmountAlign: Efficient software for local, global, semiglobal pairwise and multiple RNA sequence/structure alignment]]> https://www.researchpad.co/article/N67fc2065-7e6a-4783-aab9-eb74d3ac0a95

Alignment of structural RNAs is an important problem with a wide range of applications. Since function is often determined by molecular structure, RNA alignment programs should take into account both sequence and base-pairing information for structural homology identification. This paper describes C++ software, RNAmountAlign, for RNA sequence/structure alignment that runs in O(n3) time and O(n2) space for two sequences of length n; moreover, our software returns a p-value (transformable to expect value E) based on Karlin-Altschul statistics for local alignment, as well as parameter fitting for local and global alignment. Using incremental mountain height, a representation of structural information computable in cubic time, RNAmountAlign implements quadratic time pairwise local, global and global/semiglobal (query search) alignment using a weighted combination of sequence and structural similarity. RNAmountAlign is capable of performing progressive multiple alignment as well. Benchmarking of RNAmountAlign against LocARNA, LARA, FOLDALIGN, DYNALIGN, STRAL, MXSCARNA, and MUSCLE shows that RNAmountAlign has reasonably good accuracy and faster run time supporting all alignment types. Additionally, our extension of RNAmountAlign, called RNAmountAlignScan, which scans a target genome sequence to find hits having high sequence and structural similarity to a given query sequence, outperforms RSEARCH and sequence-only query scans and runs faster than FOLDALIGN query scan.

]]>
<![CDATA[Disease-relevant mutations alter amino acid co-evolution networks in the second nucleotide binding domain of CFTR]]> https://www.researchpad.co/article/N211c75a7-eaac-4644-b655-cac4e239c2e4

Cystic Fibrosis (CF) is an inherited disease caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR) ion channel. Mutations in CFTR cause impaired chloride ion transport in the epithelial tissues of patients leading to cardiopulmonary decline and pancreatic insufficiency in the most severely affected patients. CFTR is composed of twelve membrane-spanning domains, two nucleotide-binding domains (NBDs), and a regulatory domain. The most common mutation in CFTR is a deletion of phenylalanine at position 508 (ΔF508) in NBD1. Previous research has primarily concentrated on the structure and dynamics of the NBD1 domain; However numerous pathological mutations have also been found in the lesser-studied NBD2 domain. We have investigated the amino acid co-evolved network of interactions in NBD2, and the changes that occur in that network upon the introduction of CF and CF-related mutations (S1251N(T), S1235R, D1270N, N1303K(T)). Extensive coupling between the α- and β-subdomains were identified with residues in, or near Walker A, Walker B, H-loop and C-loop motifs. Alterations in the predicted residue network varied from moderate for the S1251T perturbation to more severe for N1303T. The S1235R and D1270N networks varied greatly compared to the wildtype, but these CF mutations only affect ion transport preference and do not severely disrupt CFTR function, suggesting dynamic flexibility in the network of interactions in NBD2. Our results also suggest that inappropriate interactions between the β-subdomain and Q-loop could be detrimental. We also identified mutations predicted to stabilize the NBD2 residue network upon introduction of the CF and CF-related mutations, and these predicted mutations are scored as benign by the MUTPRED2 algorithm. Our results suggest the level of disruption of the co-evolution predictions of the amino acid networks in NBD2 does not have a straightforward correlation with the severity of the CF phenotypes observed.

]]>
<![CDATA[PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility]]> https://www.researchpad.co/article/5c8823b3d5eed0c484638e7d

Introduction

Phylogenetic analysis plays a crucial role in quality control in the HIV drug resistance testing laboratory. If previous patient sequence data is available sample swaps can be detected and investigated. As Antiretroviral treatment coverage is increasing in many developing countries, so is the need for HIV drug resistance testing. In countries with multiple languages, transcription errors are easily made with patient identifiers. Here a self-contained blastn integrated phylogenetic pipeline can be especially useful. Even though our pipeline can run on any unix based system, a Raspberry Pi 3 is used here as a very affordable and integrated solution.

Performance benchmarks

The computational capability of this single board computer is demonstrated as well as the utility thereof in the HIV drug resistance laboratory. Benchmarking analysis against a large public database shows excellent time performance with minimal user intervention. This pipeline also contains utilities to find previous sequences as well as phylogenetic analysis and a graphical sequence mapping utility against the pol area of the HIV HXB2 reference genome. Sequence data from the Los Alamos HIV database was analyzed for inter- and intra-patient diversity and logistic regression was conducted on the calculated genetic distances. These findings show that allowable clustering and genetic distance between viral sequences from different patients is very dependent on subtype as well as the area of the viral genome being analyzed.

Availability

The Raspberry Pi image for PhyloPi, source code of the pipeline, sequence data, bash-, python- and R-scripts for the logistic regression, benchmarking as well as helper scripts are available at http://scholar.ufs.ac.za:8080/xmlui/handle/11660/7638 and https://github.com/ArmandBester/phylopi. The PhyloPi image and the source code are published under the GPLv3 license. A demo version of the PhyloPi pipeline is available at http://phylopi.hpc.ufs.ac.za/.

]]>
<![CDATA[16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses]]> https://www.researchpad.co/article/5c7ee7c5d5eed0c4848f4d9c

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

]]>
<![CDATA[BioJava 5: A community driven open-source bioinformatics library]]> https://www.researchpad.co/article/5c6730bad5eed0c484f37fa8

BioJava is an open-source project that provides a Java library for processing biological data. The project aims to simplify bioinformatic analyses by implementing parsers, data structures, and algorithms for common tasks in genomics, structural biology, ontologies, phylogenetics, and more. Since 2012, we have released two major versions of the library (4 and 5) that include many new features to tackle challenges with increasingly complex macromolecular structure data. BioJava requires Java 8 or higher and is freely available under the LGPL 2.1 license. The project is hosted on GitHub at https://github.com/biojava/biojava. More information and documentation can be found online on the BioJava website (http://www.biojava.org) and tutorial (https://github.com/biojava/biojava-tutorial). All inquiries should be directed to the GitHub page or the BioJava mailing list (http://lists.open-bio.org/mailman/listinfo/biojava-l).

]]>
<![CDATA[Overcoming the problem of multicollinearity in sports performance data: A novel application of partial least squares correlation analysis]]> https://www.researchpad.co/article/5c6f1492d5eed0c48467a325

Objectives

Professional sporting organisations invest considerable resources collecting and analysing data in order to better understand the factors that influence performance. Recent advances in non-invasive technologies, such as global positioning systems (GPS), mean that large volumes of data are now readily available to coaches and sport scientists. However analysing such data can be challenging, particularly when sample sizes are small and data sets contain multiple highly correlated variables, as is often the case in a sporting context. Multicollinearity in particular, if not treated appropriately, can be problematic and might lead to erroneous conclusions. In this paper we present a novel ‘leave one variable out’ (LOVO) partial least squares correlation analysis (PLSCA) methodology, designed to overcome the problem of multicollinearity, and show how this can be used to identify the training load (TL) variables that influence most ‘end fitness’ in young rugby league players.

Methods

The accumulated TL of sixteen male professional youth rugby league players (17.7 ± 0.9 years) was quantified via GPS, a micro-electrical-mechanical-system (MEMS), and players’ session-rating-of-perceived-exertion (sRPE) over a 6-week pre-season training period. Immediately prior to and following this training period, participants undertook a 30–15 intermittent fitness test (30-15IFT), which was used to determine a players ‘starting fitness’ and ‘end fitness’. In total twelve TL variables were collected, and these along with ‘starting fitness’ as a covariate were regressed against ‘end fitness’. However, considerable multicollinearity in the data (VIF >1000 for nine variables) meant that the multiple linear regression (MLR) process was unstable and so we developed a novel LOVO PLSCA adaptation to quantify the relative importance of the predictor variables and thus minimise multicollinearity issues. As such, the LOVO PLSCA was used as a tool to inform and refine the MLR process.

Results

The LOVO PLSCA identified the distance accumulated at very-high speed (>7 m·s-1) as being the most important TL variable to influence improvement in player fitness, with this variable causing the largest decrease in singular value inertia (5.93). When included in a refined linear regression model, this variable, along with ‘starting fitness’ as a covariate, explained 73% of the variance in v30-15IFT ‘end fitness’ (p<0.001) and eliminated completely any multicollinearity issues.

Conclusions

The LOVO PLSCA technique appears to be a useful tool for evaluating the relative importance of predictor variables in data sets that exhibit considerable multicollinearity. When used as a filtering tool, LOVO PLSCA produced a MLR model that demonstrated a significant relationship between ‘end fitness’ and the predictor variable ‘accumulated distance at very-high speed’ when ‘starting fitness’ was included as a covariate. As such, LOVO PLSCA may be a useful tool for sport scientists and coaches seeking to analyse data sets obtained using GPS and MEMS technologies.

]]>
<![CDATA[A combined computational strategy of sequence and structural analysis predicts the existence of a functional eicosanoid pathway in Drosophila melanogaster]]> https://www.researchpad.co/article/5c6c7583d5eed0c4843cfe40

This study reports on a putative eicosanoid biosynthesis pathway in Drosophila melanogaster and challenges the currently held view that mechanistic routes to synthesize eicosanoid or eicosanoid-like biolipids do not exist in insects, since to date, putative fly homologs of most mammalian enzymes have not been identified. Here we use systematic and comprehensive bioinformatics approaches to identify most of the mammalian eicosanoid synthesis enzymes. Sensitive sequence analysis techniques identified candidate Drosophila enzymes that share low global sequence identities with their human counterparts. Twenty Drosophila candidates were selected based upon (a) sequence identity with human enzymes of the cyclooxygenase and lipoxygenase branches, (b) similar domain architecture and structural conservation of the catalytic domain, and (c) presence of potentially equivalent functional residues. Evaluation of full-length structural models for these 20 top-scoring Drosophila candidates revealed a surprising degree of conservation in their overall folds and potential analogs for functional residues in all 20 enzymes. Although we were unable to identify any suitable candidate for lipoxygenase enzymes, we report structural homology models of three fly cyclooxygenases. Our findings predict that the D. melanogaster genome likely codes for one or more pathways for eicosanoid or eicosanoid-like biolipid synthesis. Our study suggests that classical and/or novel eicosanoids mediators must regulate biological functions in insects–predictions that can be tested with the power of Drosophila genetics. Such experimental analysis of eicosanoid biology in a simple model organism will have high relevance to human development and health.

]]>
<![CDATA[Evolution of the modular, disordered stress proteins known as dehydrins]]> https://www.researchpad.co/article/5c648d51d5eed0c484c8254c

Dehydrins, plant proteins that are upregulated during dehydration stress conditions, have modular sequences that can contain three conserved motifs (the Y-, S-, and K-segments). The presence and order of these motifs are used to classify dehydrins into one of five architectures: Kn, SKn, KnS, YnKn, and YnSKn, where the subscript n describes the number of copies of that motif. In this study, an architectural and phylogenetic analysis was performed on 426 dehydrin sequences that were identified in 53 angiosperm and 3 gymnosperm genomes. It was found that angiosperms contained all five architectures, while gymnosperms only contained Kn and SKn dehydrins. This suggests that the ancestral dehydrin in spermatophytes was either Kn or SKn, and the Y-segment containing dehydrins first arose in angiosperms. A high-level split between the YnSKn dehydrins from either the Kn or SKn dehydrins could not be confidently identified, however, two lower level architectural divisions appear to have occurred after different duplication events. The first likely occurred after a whole genome duplication, resulting in the duplication of a Y3SK2 dehydrin; the duplicate subsequently lost an S- and K- segment to become a Y3K1 dehydrin. The second split occurred after a tandem duplication of a Y1SK2 dehydrin, where the duplicate lost both the Y- and S- segment and gained four K-segments, resulting in a K6 dehydrin. We suggest that the newly arisen Y3K1 dehydrin is possibly on its way to pseudogenization, while the newly arisen K6 dehydrin developed a novel function in cold protection.

]]>
<![CDATA[Designing and running an advanced Bioinformatics and genome analyses course in Tunisia]]> https://www.researchpad.co/article/5c58d660d5eed0c484031d37

Genome data, with underlying new knowledge, are accumulating at exponential rate thanks to ever-improving sequencing technologies and the parallel development of dedicated efficient Bioinformatics methods and tools. Advanced Education in Bioinformatics and Genome Analyses is to a large extent not accessible to students in developing countries where endeavors to set up Bioinformatics courses concern most often only basic levels. Here, we report a pioneering pilot experience concerning the design and implementation, from scratch, of a three-months advanced and extensive course in Bioinformatics and Genome Analyses in the Institut Pasteur de Tunis. Most significantly the outcome of the course was upgrading the participants’ skills in Bioinformatics and Genome Analyses to recognized international standards. Here we detail the different steps involved in the implementation of this course as well as the topics covered in the program. The description of this pilot experience might be helpful for the implementation of other similar educational projects, notably in developing countries, aiming to go beyond basics and providing young researchers with high-level skills.

]]>
<![CDATA[The wood decay fungus Cerrena unicolor adjusts its metabolism to grow on various types of wood and light conditions]]> https://www.researchpad.co/article/5c633975d5eed0c484ae67e0

Cerrena unicolor is a wood-degrading basidiomycete with ecological and biotechnological importance. Comprehensive Biolog-based analysis was performed to assess the metabolic capabilities and sensitivity to chemicals of C. unicolor FCL139 growing in various sawdust substrates and light conditions. The metabolic preferences of the fungus towards utilization of specific substrates were shown to be correlated with the sawdust medium applied for fungus growth and the light conditions. The highest catabolic activity of C. unicolor was observed after fungus precultivation on birch and ash sawdust media. The fungus growing in the dark showed the highest metabolic activity which was indicated by capacity to utilize a broad spectrum of compounds and the decomposition of 74/95 of the carbon sources. In all the culture light conditions, p-hydroxyphenylacetic acid was the most readily metabolized compound. The greatest tolerance to chemicals was also observed during C. unicolor growth in darkness. The fungus was the most sensitive to nitrogen compounds and antibiotics, but more resistant to chelators. Comparative analysis of C. unicolor and selected wood-decay fungi from different taxonomic and ecological groups revealed average catabolic activity of the fungus. However, C. unicolor showed outstanding capabilities to catabolize salicin and arbutin. The obtained picture of C. unicolor metabolism showed that the fungus abilities to decompose woody plant material are influenced by various environmental factors.

]]>
<![CDATA[Early exposure to UV radiation overshadowed by precipitation and litter quality as drivers of decomposition in the northern Chihuahuan Desert]]> https://www.researchpad.co/article/5c61e929d5eed0c48496f8b3

Dryland ecosystems cover nearly 45% of the Earth’s land area and account for large proportions of terrestrial net primary production and carbon pools. However, predicting rates of plant litter decomposition in these vast ecosystems has proven challenging due to their distinctly dry and often hot climate regimes, and potentially unique physical drivers of decomposition. In this study, we elucidated the role of photopriming, i.e. exposure of standing dead leaf litter to solar radiation prior to litter drop that would chemically change litter and enhance biotic decay of fallen litter. We exposed litter substrates to three different UV radiation treatments simulating three-months of UV radiation exposure in southern New Mexico: no light, UVA+UVB+Visible, and UVA+Visible. There were three litter types: mesquite leaflets (Prosopis glandulosa, litter with high nitrogen (N) concentration), filter paper (pure cellulose), and basswood (Tilia spp, high lignin concentration). We deployed the photoprimed litter in the field within a large scale precipitation manipulation experiment: ∼50% precipitation reduction, ∼150% precipitation addition, and ambient control. Our results revealed the importance of litter substrate, particularly N content, for overall decomposition in drylands, as neither filter paper nor basswood exhibited measurable mass loss over the course of the year-long study, while high N-containing mesquite litter exhibited potential mass loss. We saw no effect of photopriming on subsequent microbial decay. We did observe a precipitation effect on mesquite where the rate of decay was more rapid in ambient and precipitation addition treatments than in the drought treatment. Overall, we found that precipitation and N played a critical role in litter mass loss. In contrast, photopriming had no detected effects on mass loss over the course of our year-long study. These results underpin the importance of biotic-driven decomposition, even in the presence of photopriming, for understanding litter decomposition and biogeochemical cycles in drylands.

]]>
<![CDATA[Integrating predicted transcriptome from multiple tissues improves association detection]]> https://www.researchpad.co/article/5c50c43bd5eed0c4845e8359

Integration of genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies is needed to improve our understanding of the biological mechanisms underlying GWAS hits, and our ability to identify therapeutic targets. Gene-level association methods such as PrediXcan can prioritize candidate targets. However, limited eQTL sample sizes and absence of relevant developmental and disease context restrict our ability to detect associations. Here we propose an efficient statistical method (MultiXcan) that leverages the substantial sharing of eQTLs across tissues and contexts to improve our ability to identify potential target genes. MultiXcan integrates evidence across multiple panels using multivariate regression, which naturally takes into account the correlation structure. We apply our method to simulated and real traits from the UK Biobank and show that, in realistic settings, we can detect a larger set of significantly associated genes than using each panel separately. To improve applicability, we developed a summary result-based extension called S-MultiXcan, which we show yields highly concordant results with the individual level version when LD is well matched. Our multivariate model-based approach allowed us to use the individual level results as a gold standard to calibrate and develop a robust implementation of the summary-based extension. Results from our analysis as well as software and necessary resources to apply our method are publicly available.

]]>
<![CDATA[Molecular epidemiology of Blastocystis isolated from animals in the state of Rio de Janeiro, Brazil]]> https://www.researchpad.co/article/5c57e692d5eed0c484ef37b2

The enteric protist Blastocystis is one of the most frequently reported parasites infecting both humans and many other animal hosts worldwide. A remarkable genetic diversity has been observed in the species, with 17 different subtypes (STs) on a molecular phylogeny based on small subunit RNA genes (SSU rDNA). Nonetheless, information regarding its distribution, diversity and zoonotic potential remains still scarce, especially in groups other than primates. In Brazil, only a few surveys limited to human isolates have so far been conducted on Blastocystis STs. The aim of this study is to determine the occurrence of Blastocystis subtypes in non-human vertebrate and invertebrate animal groups in different areas of the state of Rio de Janeiro, Brazil. A total of 334 stool samples were collected from animals representing 28 different genera. Blastocystis cultivated samples were subtyped using nuclear small subunit ribosomal DNA (SSU rDNA) sequencing. Phylogenetic analyses and BLAST searches revealed six subtypes: ST5 (28.8%), ST2 (21.1%), ST1 and ST8 (19.2%), ST3 (7.7%) and ST4 (3.8%). Our findings indicate a considerable overlap between STs in humans and other animals. This highlights the importance of investigating a range of hosts for Blastocystis to understand the eco-epidemiological aspects of the parasite and its host specificity.

]]>
<![CDATA[Deterministic column subset selection for single-cell RNA-Seq]]> https://www.researchpad.co/article/5c64493fd5eed0c484c2f93e

Analysis of single-cell RNA sequencing (scRNA-Seq) data often involves filtering out uninteresting or poorly measured genes and dimensionality reduction to reduce noise and simplify data visualization. However, techniques such as principal components analysis (PCA) fail to preserve non-negativity and sparsity structures present in the original matrices, and the coordinates of projected cells are not easily interpretable. Commonly used thresholding methods to filter genes avoid those pitfalls, but ignore collinearity and covariance in the original matrix. We show that a deterministic column subset selection (DCSS) method possesses many of the favorable properties of common thresholding methods and PCA, while avoiding pitfalls from both. We derive new spectral bounds for DCSS. We apply DCSS to two measures of gene expression from two scRNA-Seq experiments with different clustering workflows, and compare to three thresholding methods. In each case study, the clusters based on the small subset of the complete gene expression profile selected by DCSS are similar to clusters produced from the full set. The resulting clusters are informative for cell type.

]]>
<![CDATA[Distribution of Scedosporium species in soil from areas with high human population density and tourist popularity in six geographic regions in Thailand]]> https://www.researchpad.co/article/5c521862d5eed0c484797eef

Scedosporium is a genus comprising at least 10 species of airborne fungi (saprobes) that survive and grow on decaying organic matter. These fungi are found in high density in human-affected areas such as sewage-contaminated water, and five species, namely Scedosporium apiospermum, S. boydii, S. aurantiacum, S. dehoogii, and S. minutisporum, cause human infections. Thailand is a popular travel destination in the world, with many attractions present in densely populated areas; thus, large numbers of people may be exposed to pathogens present in these areas. We conducted a comprehensive survey of Scedosporium species in 350 soil samples obtained from 35 sites of high human population density and tourist popularity distributed over 23 provinces and six geographic regions of Thailand. Soil suspensions of each sample were inoculated on three plates of Scedo-Select III medium to isolate Scedosporium species. In total, 191 Scedosporium colonies were isolated from four provinces. The species were then identified using PCR and sequencing of the beta-tubulin (BT2) gene. Of the 191 isolates, 188 were S. apiospermum, one was S. dehoogii, and species of two could not be exactly identified. Genetic diversity analysis revealed high haplotype diversity of S. apiospermum. Soil is a major ecological niche for Scedosporium and may contain S. apiospermum populations with high genetic diversity. This study of Scedosporium distribution might encourage health care providers to consider Scedosporium infection in their patients.

]]>
<![CDATA[Statistical investigations of protein residue direct couplings]]> https://www.researchpad.co/article/5c33c39fd5eed0c48459e46b

Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.

]]>
<![CDATA[Two-dimensional local Fourier image reconstruction via domain decomposition Fourier continuation method]]> https://www.researchpad.co/article/5c3fa5aed5eed0c484ca744f

The MRI image is obtained in the spatial domain from the given Fourier coefficients in the frequency domain. It is costly to obtain the high resolution image because it requires higher frequency Fourier data while the lower frequency Fourier data is less costly and effective if the image is smooth. However, the Gibbs ringing, if existent, prevails with the lower frequency Fourier data. We propose an efficient and accurate local reconstruction method with the lower frequency Fourier data that yields sharp image profile near the local edge. The proposed method utilizes only the small number of image data in the local area. Thus the method is efficient. Furthermore the method is accurate because it minimizes the global effects on the reconstruction near the weak edges shown in many other global methods for which all the image data is used for the reconstruction. To utilize the Fourier method locally based on the local non-periodic data, the proposed method is based on the Fourier continuation method. This work is an extension of our previous 1D Fourier domain decomposition method to 2D Fourier data. The proposed method first divides the MRI image in the spatial domain into many subdomains and applies the Fourier continuation method for the smooth periodic extension of the subdomain of interest. Then the proposed method reconstructs the local image based on L2 minimization regularized by the L1 norm of edge sparsity to sharpen the image near edges. Our numerical results suggest that the proposed method should be utilized in dimension-by-dimension manner instead of in a global manner for both the quality of the reconstruction and computational efficiency. The numerical results show that the proposed method is effective when the local reconstruction is sought and that the solution is free of Gibbs oscillations.

]]>
<![CDATA[Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons]]> https://www.researchpad.co/article/5c1c0ab4d5eed0c484426918

Next generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data. FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN/dS) across time and across protein structure, and a phylogenetic tree browser. We demonstrate how FLEA may be used to process Pacific Biosciences HIV env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV env populations. A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.

]]>