ResearchPad - Software https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Visualize omics data on networks with Omics Visualizer, a Cytoscape App]]> https://www.researchpad.co/product?articleinfo=N9ec7b981-1580-4341-97c9-91419916279f

Cytoscape is an open-source software used to analyze and visualize biological networks. In addition to being able to import networks from a variety of sources, Cytoscape allows users to import tabular node data and visualize it onto networks. Unfortunately, such data tables can only contain one row of data per node, whereas omics data often have multiple rows for the same gene or protein, representing different post-translational modification sites, peptides, splice isoforms, or conditions. Here, we present a new app, Omics Visualizer, that allows users to import data tables with several rows referring to the same node, connect them to one or more networks, and visualize the connected data onto networks. Omics Visualizer uses the Cytoscape enhancedGraphics app to show the data either in the nodes (pie visualization) or around the nodes (donut visualization), where the colors of the slices represent the imported values. If the user does not provide a network, the app can retrieve one from the STRING database using the Cytoscape stringApp. The Omics Visualizer app is freely available at https://apps.cytoscape.org/apps/omicsvisualizer.

]]>
<![CDATA[The UCSF Mouse Inventory Database Application, an Open Source Web App for Sharing Mutant Mice Within a Research Community]]> https://www.researchpad.co/product?articleinfo=Nbb3b2ed7-43fd-4a80-9469-797d6b2ba821

The UCSF Mouse Inventory Database Application is an open-source Web App that provides information about the mutant alleles, transgenes, and inbred strains maintained by investigators at the university and facilitates sharing of these resources within the university community. The Application is designed to promote collaboration, decrease the costs associated with obtaining genetically-modified mice, and increase access to mouse lines that are difficult to obtain. An inventory of the genetically-modified mice on campus and the investigators who maintain them is compiled from records of purchases from external sources, transfers from researchers within and outside the university, and from data provided by users. These data are verified and augmented with relevant information harvested from public databases, and stored in a succinct, searchable database secured on the university network. Here we describe this resource and provide information about how to implement and maintain such a mouse inventory database application at other institutions.

]]>
<![CDATA[ggroups: an R package for pedigree and genetic groups data]]> https://www.researchpad.co/product?articleinfo=N375fc0f1-ece6-4d23-9070-43e96f04a13e

Background

R is a multi-platform statistical software and an object oriented programming language. The package archive network for R provides CRAN repository that features over 15,000 free open source packages, at the time of writing this article (https://cran.r-project.org/web/packages, accessed in October 2019). The package ggroups is introduced in this article. The purpose of this package is providing functions for checking and processing the pedigree, calculation of the additive genetic relationship matrix and its inverse, which are used to study the population structure and predicting the genetic merit of animals. Calculation of the dominance relationship matrix and its inverse are also covered. A concept in animal breeding is genetic groups, which is about the inequality of the average genetic merits for groups of unknown parents. The package provides functions for the calculation of the matrix of genetic group contributions (Q). Calculating Q is computationally demanding, and depending on the size of the pedigree and the number of genetic groups, it might not be feasible using personal computers. Therefore, a computationally optimised function and its parallel processing alternative are provided in the package.

Results

Using sample data, outputs from different functions of the package were presented to illustrate a real experience of working with the package.

Conclusions

The presented R package is a free and open source tool mainly for quantitative geneticists and ecologists, who deal with pedigree data. It provides numerous functions for handling pedigree data, and calculating various pedigree-based matrices. Some of the functions are computationally optimised for large-scale data.

]]>
<![CDATA[Blast2Fish: a reference-based annotation web tool for transcriptome analysis of non-model teleost fish]]> https://www.researchpad.co/product?articleinfo=Nd0ff4215-3994-4c2f-b465-292d15729fd5

Background

Transcriptome analysis by next-generation sequencing has become a popular technique in recent years. This approach is quite suitable for non-model organism study, as de novo assembly is independent of prior genomic sequences of organisms. De novo sequencing has benefited many studies on commercially important fish species. However, to understand the functions of these assembled sequences, they still need to be annotated with existing sequence databases. By combining Basic Local Alignment Search Tool (BLAST) and Gene Ontology analysis, we were able to identify homologous sequences of assembled sequences and describe their characteristics using pre-defined tags for each gene, though the above conventional annotation results obtained for non-model assembled sequences was still associated with a lack of pre-defined tags and poorly documented records in the database.

Results

We introduced Blast2Fish, a novel approach for performing functional enrichment analysis on non-model teleost fish transcriptome data. The Blast2Fish pipeline was designed to be a reference-based enrichment method. Instead of annotating the BLAST single top hit by a pre-defined gene-to-tag database, we included 500 hits to search related PubMed articles and parse biological terms. These descriptive terms were then sorted and recorded as annotations for the query. The results showed that Blast2Fish was capable of providing meaningful annotations on immunology topics for non-model fish transcriptome analysis.

Conclusion

Blast2Fish provides a novel approach for annotating sequences of non-model fish. The reference-based strategy allows annotation to be performed without pre-defined tags for each gene. This method strongly benefits non-model teleost fish studies for gene functional enrichment analysis.

]]>
<![CDATA[MFsim—an open Java all-in-one rich-client simulation environment for mesoscopic simulation]]> https://www.researchpad.co/product?articleinfo=Nb8d84e57-cca0-4d45-804e-454f1ce9aabb

MFsim is an open Java all-in-one rich-client computing environment for mesoscopic simulation with Jdpd as its default simulation kernel for Molecular Fragment (Dissipative Particle) Dynamics. The new environment comprises the complete preparation-simulation–evaluation triad of a mesoscopic simulation task and especially enables biomolecular simulation tasks with peptides and proteins. Productive highlights are a SPICES molecular structure editor, a PDB-to-SPICES parser for particle-based peptide/protein representations, a support of polymer definitions, a compartment editor for complex simulation box start configurations, interactive and flexible simulation box views including analytics, simulation movie generation or animated diagrams. As an open project, MFsim allows for customized extensions for different fields of research.

]]>
<![CDATA[Negative binomial additive model for RNA-Seq data analysis]]> https://www.researchpad.co/product?articleinfo=N14ca9b37-a8fc-4b5f-86a4-5069844e13da

Background

High-throughput sequencing experiments followed by differential expression analysis is a widely used approach for detecting genomic biomarkers. A fundamental step in differential expression analysis is to model the association between gene counts and covariates of interest. Existing models assume linear effect of covariates, which is restrictive and may not be sufficient for certain phenotypes.

Results

We introduce NBAMSeq, a flexible statistical model based on the generalized additive model and allows for information sharing across genes in variance estimation. Specifically, we model the logarithm of mean gene counts as sums of smooth functions with the smoothing parameters and coefficients estimated simultaneously within a nested iterative method. The variance is estimated by the Bayesian shrinkage approach to fully exploit the information across all genes.

Conclusions

Based on extensive simulations and case studies of RNA-Seq data, we show that NBAMSeq offers improved performance in detecting nonlinear effect and maintains equivalent performance in detecting linear effect compared to existing methods. The vignette and source code of NBAMSeq are available at http://bioconductor.org/packages/release/bioc/html/NBAMSeq.html.

]]>
<![CDATA[wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data]]> https://www.researchpad.co/product?articleinfo=N11c6685f-5d17-4d8e-85f6-86040f6dbd34

Background

Analysing whole genome bisulfite sequencing datasets is a data-intensive task that requires comprehensive and reproducible workflows to generate valid results. While many algorithms have been developed for tasks such as alignment, comprehensive end-to-end pipelines are still sparse. Furthermore, previous pipelines lack features or show technical deficiencies, thus impeding analyses.

Results

We developed wg-blimp (whole genome bisulfite sequencing methylation analysis pipeline) as an end-to-end pipeline to ease whole genome bisulfite sequencing data analysis. It integrates established algorithms for alignment, quality control, methylation calling, detection of differentially methylated regions, and methylome segmentation, requiring only a reference genome and raw sequencing data as input. Comparing wg-blimp to previous end-to-end pipelines reveals similar setups for common sequence processing tasks, but shows differences for post-alignment analyses. We improve on previous pipelines by providing a more comprehensive analysis workflow as well as an interactive user interface. To demonstrate wg-blimp’s ability to produce correct results we used it to call differentially methylated regions for two publicly available datasets. We were able to replicate 112 of 114 previously published regions, and found results to be consistent with previous findings. We further applied wg-blimp to a publicly available sample of embryonic stem cells to showcase methylome segmentation. As expected, unmethylated regions were in close proximity of transcription start sites. Segmentation results were consistent with previous analyses, despite different reference genomes and sequencing techniques.

Conclusions

wg-blimp provides a comprehensive analysis pipeline for whole genome bisulfite sequencing data as well as a user interface for simplified result inspection. We demonstrated its applicability by analysing multiple publicly available datasets. Thus, wg-blimp is a relevant alternative to previous analysis pipelines and may facilitate future epigenetic research.

]]>
<![CDATA[DiSCount: computer vision for automated quantification of Striga seed germination]]> https://www.researchpad.co/product?articleinfo=N945aecc8-eb16-4482-9157-7d58c8ea30c3

Background

Plant parasitic weeds belonging to the genus Striga are a major threat for food production in Sub-Saharan Africa and Southeast Asia. The parasite’s life cycle starts with the induction of seed germination by host plant-derived signals, followed by parasite attachment, infection, outgrowth, flowering, reproduction, seed set and dispersal. Given the small seed size of the parasite (< 200 μm), quantification of the impact of new control measures that interfere with seed germination relies on manual, labour-intensive counting of seed batches under the microscope. Hence, there is a need for high-throughput assays that allow for large-scale screening of compounds or microorganisms that adversely affect Striga seed germination.

Results

Here, we introduce DiSCount (Digital Striga Counter): a computer vision tool for automated quantification of total and germinated Striga seed numbers in standard glass fibre filter assays. We developed the software using a machine learning approach trained with a dataset of 98 manually annotated images. Then, we validated and tested the model against a total dataset of 188 manually counted images. The results showed that DiSCount has an average error of 3.38 percentage points per image compared to the manually counted dataset. Most importantly, DiSCount achieves a 100 to 3000-fold speed increase in image analysis when compared to manual analysis, with an inference time of approximately 3 s per image on a single CPU and 0.1 s on a GPU.

Conclusions

DiSCount is accurate and efficient in quantifying total and germinated Striga seeds in a standardized germination assay. This automated computer vision tool enables for high-throughput, large-scale screening of chemical compound libraries and biological control agents of this devastating parasitic weed. The complete software and manual are hosted at https://gitlab.com/lodewijk-track32/discount_paper and the archived version is available at Zenodo with the DOI 10.5281/zenodo.3627138. The dataset used for testing is available at Zenodo with the DOI 10.5281/zenodo.3403956.

]]>
<![CDATA[CReM: chemically reasonable mutations framework for structure generation]]> https://www.researchpad.co/product?articleinfo=N1a647cf1-00ee-41a8-9b19-2fd1ad2009ee

Structure generators are widely used in de novo design studies and their performance substantially influences an outcome. Approaches based on the deep learning models and conventional atom-based approaches may result in invalid structures and fail to address their synthetic feasibility issues. On the other hand, conventional reaction-based approaches result in synthetically feasible compounds but novelty and diversity of generated compounds may be limited. Fragment-based approaches can provide both better novelty and diversity of generated compounds but the issue of synthetic complexity of generated structure was not explicitly addressed before. Here we developed a new framework of fragment-based structure generation that, by design, results in the chemically valid structures and provides flexible control over diversity, novelty, synthetic complexity and chemotypes of generated compounds. The framework was implemented as an open-source Python module and can be used to create custom workflows for the exploration of chemical space.

]]>
<![CDATA[AutoGrow4: an open-source genetic algorithm for de novo drug design and lead optimization]]> https://www.researchpad.co/product?articleinfo=N14cd2782-f5e9-4db3-a5ae-69f0a790ca95

We here present AutoGrow4, an open-source program for semi-automated computer-aided drug discovery. AutoGrow4 uses a genetic algorithm to evolve predicted ligands on demand and so is not limited to a virtual library of pre-enumerated compounds. It is a useful tool for generating entirely novel drug-like molecules and for optimizing preexisting ligands. By leveraging recent computational and cheminformatics advancements, AutoGrow4 is faster, more stable, and more modular than previous versions. It implements new docking-program compatibility, chemical filters, multithreading options, and selection methods to support a wide range of user needs. To illustrate both de novo design and lead optimization, we here apply AutoGrow4 to the catalytic domain of poly(ADP-ribose) polymerase 1 (PARP-1), a well characterized DNA-damage-recognition protein. AutoGrow4 produces drug-like compounds with better predicted binding affinities than FDA-approved PARP-1 inhibitors (positive controls). The predicted binding modes of the AutoGrow4 compounds mimic those of the known inhibitors, even when AutoGrow4 is seeded with random small molecules. AutoGrow4 is available under the terms of the Apache License, Version 2.0. A copy can be downloaded free of charge from http://durrantlab.com/autogrow4.

]]>
<![CDATA[MI-MAAP: marker informativeness for multi-ancestry admixed populations]]> https://www.researchpad.co/product?articleinfo=N3397d53a-1344-439a-8519-47b239db6708

Background

Admixed populations arise when two or more previously isolated populations interbreed. A powerful approach to addressing the genetic complexity in admixed populations is to infer ancestry. Ancestry inference including the proportion of an individual’s genome coming from each population and its ancestral origin along the chromosome of an admixed population requires the use of ancestry informative markers (AIMs) from reference ancestral populations. AIMs exhibit substantial differences in allele frequency between ancestral populations. Given the huge amount of human genetic variation data available from diverse populations, a computationally feasible and cost-effective approach is becoming increasingly important to extract or filter AIMs with the maximum information content for ancestry inference, admixture mapping, forensic applications, and detecting genomic regions that have been under recent selection.

Results

To address this gap, we present MI-MAAP, an easy-to-use web-based bioinformatics tool designed to prioritize informative markers for multi-ancestry admixed populations by utilizing feature selection methods and multiple genomics resources including 1000 Genomes Project and Human Genome Diversity Project. Specifically, this tool implements a novel allele frequency-based feature selection algorithm, Lancaster Estimator of Independence (LEI), as well as other genotype-based methods such as Principal Component Analysis (PCA), Support Vector Machine (SVM), and Random Forest (RF). We demonstrated that MI-MAAP is a useful tool in prioritizing informative markers and accurately classifying ancestral populations. LEI is an efficient feature selection strategy to retrieve ancestry informative variants with different allele frequency/selection pressure among (or between) ancestries without requiring computationally expensive individual-level genotype data.

Conclusions

MI-MAAP has a user-friendly interface which provides researchers an easy and fast way to filter and identify AIMs. MI-MAAP can be accessed at https://research.cchmc.org/mershalab/MI-MAAP/login/.

]]>
<![CDATA[Methods in Molecular Medicine: Microarrays in Clinical Diagnostics. Thomas O. Joos and Paolo Fortina, editors. Totowa, NJ: Humana Press, 2005, 288 pp., $121.50, hardcover. ISBN 1-58829-394-7.]]> https://www.researchpad.co/product?articleinfo=N9f1cd804-d234-4c75-a088-40dc588e1822 ]]> <![CDATA[Transcriptome Ortholog Alignment Sequence Tools (TOAST) for phylogenomic dataset assembly]]> https://www.researchpad.co/product?articleinfo=N8441a2cc-541d-4f51-ad4a-2e0a8e09c6cf

Background

Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.

Results

We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.

Conclusions

TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Software, a detailed manual, and example data files are available through github carolinafishes.github.io

]]>
<![CDATA[ fromage: A library for the study of molecular crystal excited states at the aggregate scale]]> https://www.researchpad.co/product?articleinfo=N8def9390-e4e4-4944-9e5f-358b43f02b48

Abstract

The study of photoexcitations in molecular aggregates faces the twofold problem of the increased computational cost associated with excited states and the complexity of the interactions among the constituent monomers. A mechanistic investigation of these processes requires the analysis of the intermolecular interactions, the effect of the environment, and 3D arrangements or crystal packing on the excited states. A considerable number of techniques have been tailored to navigate these obstacles; however, they are usually restricted to in‐house codes and thus require a disproportionate effort to adopt by researchers approaching the field. Herein, we present the FRamewOrk for Molecular AGgregate Excitations (fromage), which implements a collection of such techniques in a Python library complemented with ready‐to‐use scripts. The program structure is presented and the principal features available to the user are described: geometrical analysis, exciton characterization, and a variety of ONIOM schemes. Each is illustrated by examples of diverse organic molecules in condensed phase settings. The program is available at https://github.com/Crespo-Otero-group/fromage.

]]>
<![CDATA[Millefy: visualizing cell-to-cell heterogeneity in read coverage of single-cell RNA sequencing datasets]]> https://www.researchpad.co/product?articleinfo=Nab12cd21-66ab-49ab-b211-363cc543a2c4

Background

Read coverage of RNA sequencing data reflects gene expression and RNA processing events. Single-cell RNA sequencing (scRNA-seq) methods, particularly “full-length” ones, provide read coverage of many individual cells and have the potential to reveal cellular heterogeneity in RNA transcription and processing. However, visualization tools suited to highlighting cell-to-cell heterogeneity in read coverage are still lacking.

Results

Here, we have developed Millefy, a tool for visualizing read coverage of scRNA-seq data in genomic contexts. Millefy is designed to show read coverage of all individual cells at once in genomic contexts and to highlight cell-to-cell heterogeneity in read coverage. By visualizing read coverage of all cells as a heat map and dynamically reordering cells based on diffusion maps, Millefy facilitates discovery of “local” region-specific, cell-to-cell heterogeneity in read coverage. We applied Millefy to scRNA-seq data sets of mouse embryonic stem cells and triple-negative breast cancers and showed variability of transcribed regions including antisense RNAs, 3 UTR lengths, and enhancer RNA transcription.

Conclusions

Millefy simplifies the examination of cellular heterogeneity in RNA transcription and processing events using scRNA-seq data. Millefy is available as an R package (https://github.com/yuifu/millefy) and as a Docker image for use with Jupyter Notebook (https://hub.docker.com/r/yuifu/datascience-notebook-millefy).

]]>
<![CDATA[ microbiomeDASim: Simulating longitudinal differential abundance for microbiome data]]> https://www.researchpad.co/product?articleinfo=Nba83061e-83d0-43ef-851f-d9a304b6ef17

An increasing emphasis on understanding the dynamics of microbial communities in various settings has led to the proliferation of longitudinal metagenomic sampling studies. Data from whole metagenomic shotgun sequencing and marker-gene survey studies have characteristics that drive novel statistical methodological development for estimating time intervals of differential abundance. In designing a study and the frequency of collection prior to a study, one may wish to model the ability to detect an effect, e.g., there may be issues with respect to cost, ease of access, etc. Additionally, while every study is unique, it is possible that in certain scenarios one statistical framework may be more appropriate than another. Here, we present a simulation paradigm implemented in the R Bioconductor software package microbiomeDASim available at http://bioconductor.org/packages/microbiomeDASim microbiomeDASim. microbiomeDASim allows investigators to simulate longitudinal differential abundant microbiome features with a variety of known functional forms with flexible parameters to control desired signal-to-noise ratio. We present metrics of success results on one particular method called metaSplines.

]]>
<![CDATA[Monitoring stance towards vaccination in twitter messages]]> https://www.researchpad.co/product?articleinfo=N13c78427-7cb4-4fd2-9e29-a6f1db60794b

Background

We developed a system to automatically classify stance towards vaccination in Twitter messages, with a focus on messages with a negative stance. Such a system makes it possible to monitor the ongoing stream of messages on social media, offering actionable insights into public hesitance with respect to vaccination. At the moment, such monitoring is done by means of regular sentiment analysis with a poor performance on detecting negative stance towards vaccination. For Dutch Twitter messages that mention vaccination-related key terms, we annotated their stance and feeling in relation to vaccination (provided that they referred to this topic). Subsequently, we used these coded data to train and test different machine learning set-ups. With the aim to best identify messages with a negative stance towards vaccination, we compared set-ups at an increasing dataset size and decreasing reliability, at an increasing number of categories to distinguish, and with different classification algorithms.

Results

We found that Support Vector Machines trained on a combination of strictly and laxly labeled data with a more fine-grained labeling yielded the best result, at an F1-score of 0.36 and an Area under the ROC curve of 0.66, considerably outperforming the currently used sentiment analysis that yielded an F1-score of 0.25 and an Area under the ROC curve of 0.57. We also show that the recall of our system could be optimized to 0.60 at little loss of precision.

Conclusion

The outcomes of our study indicate that stance prediction by a computerized system only is a challenging task. Nonetheless, the model showed sufficient recall on identifying negative tweets so as to reduce the manual effort of reviewing messages. Our analysis of the data and behavior of our system suggests that an approach is needed in which the use of a larger training dataset is combined with a setting in which a human-in-the-loop provides the system with feedback on its predictions.

]]>
<![CDATA[ClinEpiDB: an open-access clinical epidemiology database resource encouraging online exploration of complex studies]]> https://www.researchpad.co/product?articleinfo=N3256acd6-005a-4f23-a9bb-a55c8eb6dcfa

The concept of open data has been gaining traction as a mechanism to increase data use, ensure that data are preserved over time, and accelerate discovery. While epidemiology data sets are increasingly deposited in databases and repositories, barriers to access still remain. ClinEpiDB was constructed as an open-access online resource for clinical and epidemiologic studies by leveraging the extensive web toolkit and infrastructure of the Eukaryotic Pathogen Database Resources (EuPathDB; a collection of databases covering 170+ eukaryotic pathogens, relevant related species, and select hosts) combined with a unified semantic web framework. Here we present an intuitive point-and-click website that allows users to visualize and subset data directly in the ClinEpiDB browser and immediately explore potential associations. Supporting study documentation aids contextualization, and data can be downloaded for advanced analyses. By facilitating access and interrogation of high-quality, large-scale data sets, ClinEpiDB aims to spur collaboration and discovery that improves global health.

]]>
<![CDATA[GeDi: applying suffix arrays to increase the repertoire of detectable SNVs in tumour genomes]]> https://www.researchpad.co/product?articleinfo=N3ffcb850-7114-42d6-b752-3160e809ac0e

Background

Current popular variant calling pipelines rely on the mapping coordinates of each input read to a reference genome in order to detect variants. Since reads deriving from variant loci that diverge in sequence substantially from the reference are often assigned incorrect mapping coordinates, variant calling pipelines that rely on mapping coordinates can exhibit reduced sensitivity.

Results

In this work we present GeDi, a suffix array-based somatic single nucleotide variant (SNV) calling algorithm that does not rely on read mapping coordinates to detect SNVs and is therefore capable of reference-free and mapping-free SNV detection. GeDi executes with practical runtime and memory resource requirements, is capable of SNV detection at very low allele frequency (<1%), and detects SNVs with high sensitivity at complex variant loci, dramatically outperforming MuTect, a well-established pipeline.

Conclusion

By designing novel suffix-array based SNV calling methods, we have developed a practical SNV calling software, GeDi, that can characterise SNVs at complex variant loci and at low allele frequency thus increasing the repertoire of detectable SNVs in tumour genomes. We expect GeDi to find use cases in targeted-deep sequencing analysis, and to serve as a replacement and improvement over previous suffix-array based SNV calling methods.

]]>
<![CDATA[ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles]]> https://www.researchpad.co/product?articleinfo=N8eb4af06-8f9c-45d7-9ed8-cd39c132d89c

Background

Various methods for differential expression analysis have been widely used to identify features which best distinguish between different categories of samples. Multiple hypothesis testing may leave out explanatory features, each of which may be composed of individually insignificant variables. Multivariate hypothesis testing holds a non-mainstream position, considering the large computation overhead of large-scale matrix operation. Random forest provides a classification strategy for calculation of variable importance. However, it may be unsuitable for different distributions of samples.

Results

Based on the thought of using an ensemble classifier, we develop a feature selection tool for differential expression analysis on expression profiles (i.e., ECFS-DEA for short). Considering the differences in sample distribution, a graphical user interface is designed to allow the selection of different base classifiers. Inspired by random forest, a common measure which is applicable to any base classifier is proposed for calculation of variable importance. After an interactive selection of a feature on sorted individual variables, a projection heatmap is presented using k-means clustering. ROC curve is also provided, both of which can intuitively demonstrate the effectiveness of the selected feature.

Conclusions

Feature selection through ensemble classifiers helps to select important variables and thus is applicable for different sample distributions. Experiments on simulation and realistic data demonstrate the effectiveness of ECFS-DEA for differential expression analysis on expression profiles. The software is available at http://bio-nefu.com/resource/ecfs-dea.

]]>