ResearchPad - Computational Mathematics https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[YASARA View—molecular graphics for all devices—from smartphones to workstations]]> https://www.researchpad.co/product?articleinfo=5ba773f340307c2b1f5d481a

Summary: Today's graphics processing units (GPUs) compose the scene from individual triangles. As about 320 triangles are needed to approximate a single sphere—an atom—in a convincing way, visualizing larger proteins with atomic details requires tens of millions of triangles, far too many for smooth interactive frame rates. We describe a new approach to solve this ‘molecular graphics problem’, which shares the work between GPU and multiple CPU cores, generates high-quality results with perfectly round spheres, shadows and ambient lighting and requires only OpenGL 1.0 functionality, without any pixel shader Z-buffer access (a feature which is missing in most mobile devices).

Availability and implementation: YASARA View, a molecular modeling program built around the visualization algorithm described here, is freely available (including commercial use) for Linux, MacOS, Windows and Android (Intel) from www.YASARA.org.

Contact: elmar@yasara.org

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli]]> https://www.researchpad.co/product?articleinfo=5ba773f140307c2b1f5d4819

Summary: Here we introduce ccSOL omics, a webserver for large-scale calculations of protein solubility. Our method allows (i) proteome-wide predictions; (ii) identification of soluble fragments within each sequences; (iii) exhaustive single-point mutation analysis.

Results: Using coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helix propensities, we built a predictor of protein solubility. Our approach shows an accuracy of 79% on the training set (36 990 Target Track entries). Validation on three independent sets indicates that ccSOL omics discriminates soluble and insoluble proteins with an accuracy of 74% on 31 760 proteins sharing <30% sequence similarity.

Availability and implementation: ccSOL omics can be freely accessed on the web at http://s.tartaglialab.com/page/ccsol_group. Documentation and tutorial are available at http://s.tartaglialab.com/static_files/shared/tutorial_ccsol_omics.html.

Contact: gian.tartaglia@crg.es

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[GATB: Genome Assembly & Analysis Tool Box]]> https://www.researchpad.co/product?articleinfo=5ba773ea40307c2b1f5d4816

Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints.

Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license.

Contact: lavenier@irisa.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[COSMOS: Python library for massively parallel workflows]]> https://www.researchpad.co/product?articleinfo=5ba773e840307c2b1f5d4815

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS)]]> https://www.researchpad.co/product?articleinfo=5ba773ec40307c2b1f5d4817

Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples.

Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools).

Contact: bammds-users@nongnu.org

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[RAPIDR: an analysis package for non-invasive prenatal testing of aneuploidy]]> https://www.researchpad.co/product?articleinfo=5ba773ef40307c2b1f5d4818

Non-invasive prenatal testing (NIPT) of fetal aneuploidy using cell-free fetal DNA is becoming part of routine clinical practice. RAPIDR (Reliable Accurate Prenatal non-Invasive Diagnosis R package) is an easy-to-use open-source R package that implements several published NIPT analysis methods. The input to RAPIDR is a set of sequence alignment files in the BAM format, and the outputs are calls for aneuploidy, including trisomies 13, 18, 21 and monosomy X as well as fetal sex. RAPIDR has been extensively tested with a large sample set as part of the RAPID project in the UK. The package contains quality control steps to make it robust for use in the clinical setting.

Availability and implementation: RAPIDR is implemented in R and can be freely downloaded via CRAN from here: http://cran.r-project.org/web/packages/RAPIDR/index.html.

Contact: kitty.lo@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Identification of MicroRNA Precursors with Support Vector Machine and String Kernel]]> https://www.researchpad.co/product?articleinfo=5b046486463d7e20f6793dc1

MicroRNAs (miRNAs) are one family of short (21–23 nt) regulatory non-coding RNAs processed from long (70–110 nt) miRNA precursors (pre-miRNAs). Identifying true and false precursors plays an important role in computational identification of miRNAs. Some numerical features have been extracted from precursor sequences and their secondary structures to suit some classification methods; however, they may lose some usefully discriminative information hidden in sequences and structures. In this study, pre-miRNA sequences and their secondary structures are directly used to construct an exponential kernel based on weighted Levenshtein distance between two sequences. This string kernel is then combined with support vector machine (SVM) for detecting true and false pre-miRNAs. Based on 331 training samples of true and false human pre-miRNAs, 2 key parameters in SVM are selected by 5-fold cross validation and grid search, and 5 realizations with different 5-fold partitions are executed. Among 16 independent test sets from 3 human, 8 animal, 2 plant, 1 virus, and 2 artificially false human pre-miRNAs, our method statistically outperforms the previous SVM-based technique on 11 sets, including 3 human, 7 animal, and 1 false human pre-miRNAs. In particular, pre-miRNAs with multiple loops that were usually excluded in the previous work are correctly identified in this study with an accuracy of 92.66%.

]]>
<![CDATA[The spatiotemporal order of signaling events unveils the logic of development signaling]]> https://www.researchpad.co/product?articleinfo=5b00dd3a463d7e3c2d2a5071

Motivation: Animals from worms and insects to birds and mammals show distinct body plans; however, the embryonic development of diverse body plans with tissues and organs within is controlled by a surprisingly few signaling pathways. It is well recognized that combinatorial use of and dynamic interactions among signaling pathways follow specific logic to control complex and accurate developmental signaling and patterning, but it remains elusive what such logic is, or even, what it looks like.

Results: We have developed a computational model for Drosophila eye development with innovated methods to reveal how interactions among multiple pathways control the dynamically generated hexagonal array of R8 cells. We obtained two novel findings. First, the coupling between the long-range inductive signals produced by the proneural Hh signaling and the short-range restrictive signals produced by the antineural Notch and EGFR signaling is essential for generating accurately spaced R8s. Second, the spatiotemporal orders of key signaling events reveal a robust pattern of lateral inhibition conducted by Ato-coordinated Notch and EGFR signaling to collectively determine R8 patterning. This pattern, stipulating the orders of signaling and comparable to the protocols of communication, may help decipher the well-appreciated but poorly defined logic of developmental signaling.

Availability and implementation: The model is available upon request.

Contact: hao.zhu@ymail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[An Integrative Meta-analysis of MicroRNAs in Hepatocellular Carcinoma]]> https://www.researchpad.co/product?articleinfo=5ae6aba9463d7e61d62f9c4b

We aimed to shed new light on the roles of microRNAs (miRNAs) in liver cancer using an integrative in silico bioinformatics analysis. A new protocol for target prediction and functional analysis is presented and applied to the 26 highly differentially deregulated miRNAs in hepatocellular carcinoma. This framework comprises: (1) the overlap of prediction results by four out of five target prediction tools, including TargetScan, PicTar, miRanda, DIANA-microT and miRDB (combining machine-learning, alignment, interaction energy and statistical tests in order to minimize false positives), (2) evidence from previous microarray analysis on the expression of these targets, (3) gene ontology (GO) and pathway enrichment analysis of the miRNA targets and their pathways and (4) linking these results to oncogenesis and cancer hallmarks. This yielded new insights into the roles of miRNAs in cancer hallmarks. Here we presented several key targets and hundreds of new targets that are significantly enriched in many new cancer-related hallmarks. In addition, we also revealed some known and new oncogenic pathways for liver cancer. These included the famous MAPK, TGFβ and cell cycle pathways. New insights were also provided into Wnt signaling, prostate cancer, axon guidance and oocyte meiosis pathways. These signaling and developmental pathways crosstalk to regulate stem cell transformation and implicate a role of miRNAs in hepatic stem cell deregulation and cancer development. By analyzing their complete interactome, we proposed new categorization for some of these miRNAs as either tumor-suppressors or oncomiRs with dual roles. Therefore some of these miRNAs may be addressed as therapeutic targets or used as therapeutic agents. Such dual roles thus expand the view of miRNAs as active maintainers of cellular homeostasis.

]]>
<![CDATA[A statistical approach for inferring the 3D structure of the genome]]> https://www.researchpad.co/product?articleinfo=5ada4107463d7e055efa2c26

Motivation: Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to infer, from the resulting DNA–DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction.

Methods: We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data.

Results: We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms—two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method—on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions.

Availability and implementation: A Python implementation of the proposed method is available at http://cbio.ensmp.fr/pastis.

Contact: william-noble@uw.edu or jean-philippe.vert@mines.org

]]>
<![CDATA[Functional association networks as priors for gene regulatory network inference]]> https://www.researchpad.co/product?articleinfo=5ad5e039463d7e4bf0533951

Motivation: Gene regulatory network (GRN) inference reveals the influences genes have on one another in cellular regulatory systems. If the experimental data are inadequate for reliable inference of the network, informative priors have been shown to improve the accuracy of inferences.

Results: This study explores the potential of undirected, confidence-weighted networks, such as those in functional association databases, as a prior source for GRN inference. Such networks often erroneously indicate symmetric interaction between genes and may contain mostly correlation-based interaction information. Despite these drawbacks, our testing on synthetic datasets indicates that even noisy priors reflect some causal information that can improve GRN inference accuracy. Our analysis on yeast data indicates that using the functional association databases FunCoup and STRING as priors can give a small improvement in GRN inference accuracy with biological data.

Contact: matthew.studham@scilifelab.se

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[DNorm: disease name normalization with pairwise learning to rank]]> https://www.researchpad.co/product?articleinfo=5ace579d463d7e10454d18d3

Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research.

Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval.

Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.

Availability: The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator

Contact: zhiyong.lu@nih.gov

]]>
<![CDATA[SPANNER: taxonomic assignment of sequences using pyramid matching of similarity profiles]]> https://www.researchpad.co/product?articleinfo=5accaf94463d7e47e500b4ef

Background: Homology-based taxonomic assignment is impeded by differences between the unassigned read and reference database, forcing a rank-specific classification to the closest (and possibly incorrect) reference lineage. This assignment may be correct only to a general rank (e.g. order) and incorrect below that rank (e.g. family and genus). Algorithms like LCA avoid this by varying the predicted taxonomic rank based on matches to a set of taxonomic references. LCA and related approaches can be conservative, especially if best matches are taxonomically widespread because of events such as lateral gene transfer (LGT).

Results: Our extension to LCA called SPANNER (similarity profile annotater) uses the set of best homology matches (the LCA Profile) for a given sequence and compares this profile with a set of profiles inferred from taxonomic reference organisms. SPANNER provides an assignment that is less sensitive to LGT and other confounding phenomena. In a series of trials on real and artificial datasets, SPANNER outperformed LCA-style algorithms in terms of taxonomic precision and outperformed best BLAST at certain levels of taxonomic novelty in the dataset. We identify examples where LCA made an overly conservative prediction, but SPANNER produced a more precise and correct prediction.

Conclusions: By using profiles of homology matches to represent patterns of genomic similarity that arise because of vertical and lateral inheritance, SPANNER offers an effective compromise between taxonomic assignment based on best BLAST scores, and the conservative approach of LCA and similar approaches.

Availability: C++ source code and binaries are freely available at http://kiwi.cs.dal.ca/Software/SPANNER.

Contact: beiko@cs.dal.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding]]> https://www.researchpad.co/product?articleinfo=5acc56a6463d7e4085c546f7

Motivation: Most functions within the cell emerge thanks to protein–protein interactions (PPIs), yet experimental determination of PPIs is both expensive and time-consuming. PPI networks present significant levels of noise and incompleteness. Predicting interactions using only PPI-network topology (topological prediction) is difficult but essential when prior biological knowledge is absent or unreliable.

Methods: Network embedding emphasizes the relations between network proteins embedded in a low-dimensional space, in which protein pairs that are closer to each other represent good candidate interactions. To achieve network denoising, which boosts prediction performance, we first applied minimum curvilinear embedding (MCE), and then adopted shortest path (SP) in the reduced space to assign likelihood scores to candidate interactions. Furthermore, we introduce (i) a new valid variation of MCE, named non-centred MCE (ncMCE); (ii) two automatic strategies for selecting the appropriate embedding dimension; and (iii) two new randomized procedures for evaluating predictions.

Results: We compared our method against several unsupervised and supervisedly tuned embedding approaches and node neighbourhood techniques. Despite its computational simplicity, ncMCE-SP was the overall leader, outperforming the current methods in topological link prediction.

Conclusion: Minimum curvilinearity is a valuable non-linear framework that we successfully applied to the embedding of protein networks for the unsupervised prediction of novel PPIs. The rationale for our approach is that biological and evolutionary information is imprinted in the non-linear patterns hidden behind the protein network topology, and can be exploited for predicting new protein links. The predicted PPIs represent good candidates for testing in high-throughput experiments or for exploitation in systems biology tools such as those used for network-based inference and prediction of disease-related functional modules.

Availability: https://sites.google.com/site/carlovittoriocannistraci/home

Contact: kalokagathos.agon@gmail.com or timothy.ravasi@kaust.edu.sa

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Computational Small RNA Prediction in Bacteria]]> https://www.researchpad.co/product?articleinfo=5aca47ea463d7e7c850dce55

Bacterial, small RNAs were once regarded as potent regulators of gene expression and are now being considered as essential for their diversified roles. Many small RNAs are now reported to have a wide array of regulatory functions, ranging from environmental sensing to pathogenesis. Traditionally, noncoding transcripts were rarely detected by means of genetic screens. However, the availability of approximately 2200 prokaryotic genome sequences in public databases facilitates the efficient computational search of those molecules, followed by experimental validation. In principle, the following four major computational methods were applied for the prediction of sRNA locations from bacterial genome sequences: (1) comparative genomics, (2) secondary structure and thermodynamic stability, (3) ‘Orphan’ transcriptional signals and (4) ab initio methods regardless of sequence or structure similarity; most of these tools were applied to locate the putative genomic sRNA locations followed by experimental validation of those transcripts. Therefore, computational screening has simplified the sRNA identification process in bacteria. In this review, a plethora of small RNA prediction methods and tools that have been reported in the past decade are discussed comprehensively and assessed based on their attributes, compatibility, and their prediction accuracy.

]]>
<![CDATA[Multi-view methods for protein structure comparison using latent dirichlet allocation]]> https://www.researchpad.co/product?articleinfo=5abbc972463d7e28eb5157a8

Motivation: With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures and (ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model.

Results: In this article, we propose an improved representation of protein structures using latent dirichlet allocation topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs. In order to meet diverse objectives, we propose multi-viewpoint based framework that combines multiple representations and retrieval techniques. We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform state-of-the-art methods.

Availability: http://www.cse.iitm.ac.in/~ashishvt/research/protein-lda/.

Contact: ashishvt@cse.iitm.ac.in

]]>
<![CDATA[MicroRNA-mediated regulation of target genes in several brain regions is correlated to both microRNA-targeting-specific promoter methylation and differential microRNA expression]]> https://www.researchpad.co/product?articleinfo=5989da0dab0ee8fa60b7848c

Background

Public domain databases nowadays provide multiple layers of genome-wide data e.g., promoter methylation, mRNA expression, and miRNA expression and should enable integrative modeling of the mechanisms of regulation of gene expression. However, researches along this line were not frequently executed.

Results

Here, the public domain dataset of mRNA expression, microRNA (miRNA) expression and promoter methylation patterns in four regions, the frontal cortex, temporal cortex, pons and cerebellum, of human brain were sourced from the National Center for Biotechnology Informations gene expression omnibus, and reanalyzed computationally. A large number of miRNA-mediated regulation of target genes and miRNA-targeting-specific promoter methylation were identified in the six pairwise comparisons among the four brain regions. The miRNA-mediated regulation of target genes was found to be highly correlated with one or both of miRNA-targeting-specific promoter methylation and differential miRNA expression. Genes enriched for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were related to brain function and/or development were found among the target genes of miRNAs whose differential expression patterns were highly correlated with the miRNA-mediated regulation of their target genes.

Conclusions

The combinatorial analysis of miRNA-mediated regulation of target genes, miRNA-targeting-specific promoter methylation and differential miRNA expression can help reveal the brain region-specific contributions of miRNAs to brain function and development.

]]>
<![CDATA[Big Data analysis on autopilot?]]> https://www.researchpad.co/product?articleinfo=5989db3aab0ee8fa60bd4593 ]]> <![CDATA[Using random walks to identify cancer-associated modules in expression data]]> https://www.researchpad.co/product?articleinfo=5989daf3ab0ee8fa60bc223e

Background

The etiology of cancer involves a complex series of genetic and environmental conditions. To better represent and study the intricate genetics of cancer onset and progression, we construct a network of biological interactions to search for groups of genes that compose cancer-related modules. Three cancer expression datasets are investigated to prioritize genes and interactions associated with cancer outcomes. Using a graph-based approach to search for communities of phenotype-related genes in microarray data, we find modules of genes associated with cancer phenotypes in a weighted interaction network.

Results

We implement Walktrap, a random-walk-based community detection algorithm, to identify biological modules predisposing to tumor growth in 22 hepatocellular carcinoma samples (GSE14520), adenoma development in 32 colorectal cancer samples (GSE8671), and prognosis in 198 breast cancer patients (GSE7390). For each study, we find the best scoring partitions under a maximum cluster size of 200 nodes. Significant modules highlight groups of genes that are functionally related to cancer and show promise as therapeutic targets; these include interactions among transcription factors (SPIB, RPS6KA2 and RPS6KA6), cell-cycle regulatory genes (BRSK1, WEE1 and CDC25C), modulators of the cell-cycle and proliferation (CBLC and IRS2) and genes that regulate and participate in the map-kinase pathway (MAPK9, DUSP1, DUSP9, RIPK2). To assess the performance of Walktrap to find genomic modules (Walktrap-GM), we evaluate our results against other tools recently developed to discover disease modules in biological networks. Compared with other highly cited module-finding tools, jActiveModules and Matisse, Walktrap-GM shows strong performance in the discovery of modules enriched with known cancer genes.

Conclusions

These results demonstrate that the Walktrap-GM algorithm identifies modules significantly enriched with cancer genes, their joint effects and promising candidate genes. The approach performs well when evaluated against similar tools and smaller overall module size allows for more specific functional annotation and facilitates the interpretation of these modules.

]]>
<![CDATA[Semi-supervised consensus clustering for gene expression data analysis]]> https://www.researchpad.co/product?articleinfo=5989daa6ab0ee8fa60ba77ae

Background

Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and domain knowledge.

Methods

We proposed semi-supervised consensus clustering (SSCC) to integrate the consensus clustering with semi-supervised clustering for analyzing gene expression data. We investigated the roles of consensus clustering and prior knowledge in improving the quality of clustering. SSCC was compared with one semi-supervised clustering algorithm, one consensus clustering algorithm, and k-means. Experiments on eight gene expression datasets were performed using h-fold cross-validation.

Results

Using prior knowledge improved the clustering quality by reducing the impact of noise and high dimensionality in microarray data. Integration of consensus clustering with semi-supervised clustering improved performance as compared to using consensus clustering or semi-supervised clustering separately. Our SSCC method outperformed the others tested in this paper.

]]>