ResearchPad - cluster-analysis https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Design of composite measure schemes for comparative severity assessment in animal-based neuroscience research: A case study focussed on rat epilepsy models]]> https://www.researchpad.co/article/elastic_article_14687 Comparative severity assessment of animal models and experimental interventions is of utmost relevance for harm-benefit analysis during ethical evaluation, an animal welfare-based model prioritization as well as the validation of refinement measures. Unfortunately, there is a lack of evidence-based approaches to grade an animal’s burden in a sensitive, robust, precise, and objective manner. Particular challenges need to be considered in the context of animal-based neuroscientific research because models of neurological disorders can be characterized by relevant changes in the affective state of an animal. Here, we report about an approach for parameter selection and development of a composite measure scheme designed for precise analysis of the distress of animals in a specific model category. Data sets from the analysis of several behavioral and biochemical parameters in three different epilepsy models were subjected to a principal component analysis to select the most informative parameters. The top-ranking parameters included burrowing, open field locomotion, social interaction, and saccharin preference. These were combined to create a composite measure scheme (CMS). CMS data were subjected to cluster analysis enabling the allocation of severity levels to individual animals. The results provided information for a direct comparison between models indicating a comparable severity of the electrical and chemical post-status epilepticus models, and a lower severity of the kindling model. The new CMS can be directly applied for comparison of other rat models with seizure activity or for assessment of novel refinement approaches in the respective research field. The respective online tool for direct application of the CMS or for creating a new CMS based on other parameters from different models is available at https://github.com/mytalbot/cms. However, the robustness and generalizability needs to be further assessed in future studies. More importantly, our concept of parameter selection can serve as a practice example providing the basis for comparable approaches applicable to the development and validation of CMS for all kinds of disease models or interventions.

]]>
<![CDATA[COMBSecretomics: A pragmatic methodological framework for higher-order drug combination analysis using secretomics]]> https://www.researchpad.co/article/elastic_article_14596 Multi drug treatments are increasingly used in the clinic to combat complex and co-occurring diseases. However, most drug combination discovery efforts today are mainly focused on anticancer therapy and rarely examine the potential of using more than two drugs simultaneously. Moreover, there is currently no reported methodology for performing second- and higher-order drug combination analysis of secretomic patterns, meaning protein concentration profiles released by the cells. Here, we introduce COMBSecretomics (https://github.com/EffieChantzi/COMBSecretomics.git), the first pragmatic methodological framework designed to search exhaustively for second- and higher-order mixtures of candidate treatments that can modify, or even reverse malfunctioning secretomic patterns of human cells. This framework comes with two novel model-free combination analysis methods; a tailor-made generalization of the highest single agent principle and a data mining approach based on top-down hierarchical clustering. Quality control procedures to eliminate outliers and non-parametric statistics to quantify uncertainty in the results obtained are also included. COMBSecretomics is based on a standardized reproducible format and could be employed with any experimental platform that provides the required protein release data. Its practical use and functionality are demonstrated by means of a proof-of-principle pharmacological study related to cartilage degradation. COMBSecretomics is the first methodological framework reported to enable secretome-related second- and higher-order drug combination analysis. It could be used in drug discovery and development projects, clinical practice, as well as basic biological understanding of the largely unexplored changes in cell-cell communication that occurs due to disease and/or associated pharmacological treatment conditions.

]]>
<![CDATA[Implementation of maternity protection legislation: Gynecologists’ perceptions and practices in French-speaking Switzerland]]> https://www.researchpad.co/article/elastic_article_11226 In several countries, maternity protection legislations (MPL) confer an essential role to gynecologist-obstetricians (OBGYNs) for the protection of pregnant workers and their future children from occupational exposures. This study explores OBGYNs’ practices and difficulties in implementing MPL in the French-speaking part of Switzerland.MethodsAn online survey was sent to 333 OBGYNs. Data analysis included: 1) descriptive and correlational statistics and 2) hierarchical cluster analysis to identify patterns of practices.ResultsOBGYNs evoked several problems in MPL implementation: absence of risk analysis in the companies, difficult collaboration with employers, lack of competencies in the field of occupational health. Preventive leave was underused, with sick leave being prescribed instead. Training had a positive effect on OBGYNs’ knowledge and implementation of MPL. Hierarchical cluster analysis highlighted three main types of practices: 1) practice in line with legislation; 2) practice on a case-by-case basis; 3) limited practice. OBGYNs with good knowledge of MPL more consistently applied its provisions.ConclusionThe implementation of MPL appears challenging for OBGYNs. Collaboration with occupational physicians and training might help OBGYNs to better take on their role in maternity protection. MPL in itself could be improved. ]]> <![CDATA[Determination of essential phenotypic elements of clusters in high-dimensional entities—DEPECHE]]> https://www.researchpad.co/article/5c8accc7d5eed0c48498ffa7

Technological advances have facilitated an exponential increase in the amount of information that can be derived from single cells, necessitating new computational tools that can make such highly complex data interpretable. Here, we introduce DEPECHE, a rapid, parameter free, sparse k-means-based algorithm for clustering of multi- and megavariate single-cell data. In a number of computational benchmarks aimed at evaluating the capacity to form biologically relevant clusters, including flow/mass-cytometry and single cell RNA sequencing data sets with manually curated gold standard solutions, DEPECHE clusters as well or better than the currently available best performing clustering algorithms. However, the main advantage of DEPECHE, compared to the state-of-the-art, is its unique ability to enhance interpretability of the formed clusters, in that it only retains variables relevant for cluster separation, thereby facilitating computational efficient analyses as well as understanding of complex datasets. DEPECHE is implemented in the open source R package DepecheR currently available at github.com/Theorell/DepecheR.

]]>
<![CDATA[Information integration in large brain networks]]> https://www.researchpad.co/article/5c65dcadd5eed0c484dec021

An outstanding problem in neuroscience is to understand how information is integrated across the many modules of the brain. While classic information-theoretic measures have transformed our understanding of feedforward information processing in the brain’s sensory periphery, comparable measures for information flow in the massively recurrent networks of the rest of the brain have been lacking. To address this, recent work in information theory has produced a sound measure of network-wide “integrated information”, which can be estimated from time-series data. But, a computational hurdle has stymied attempts to measure large-scale information integration in real brains. Specifically, the measurement of integrated information involves a combinatorial search for the informational “weakest link” of a network, a process whose computation time explodes super-exponentially with network size. Here, we show that spectral clustering, applied on the correlation matrix of time-series data, provides an approximate but robust solution to the search for the informational weakest link of large networks. This reduces the computation time for integrated information in large systems from longer than the lifespan of the universe to just minutes. We evaluate this solution in brain-like systems of coupled oscillators as well as in high-density electrocortigraphy data from two macaque monkeys, and show that the informational “weakest link” of the monkey cortex splits posterior sensory areas from anterior association areas. Finally, we use our solution to provide evidence in support of the long-standing hypothesis that information integration is maximized by networks with a high global efficiency, and that modular network structures promote the segregation of information.

]]>
<![CDATA[Deterministic column subset selection for single-cell RNA-Seq]]> https://www.researchpad.co/article/5c64493fd5eed0c484c2f93e

Analysis of single-cell RNA sequencing (scRNA-Seq) data often involves filtering out uninteresting or poorly measured genes and dimensionality reduction to reduce noise and simplify data visualization. However, techniques such as principal components analysis (PCA) fail to preserve non-negativity and sparsity structures present in the original matrices, and the coordinates of projected cells are not easily interpretable. Commonly used thresholding methods to filter genes avoid those pitfalls, but ignore collinearity and covariance in the original matrix. We show that a deterministic column subset selection (DCSS) method possesses many of the favorable properties of common thresholding methods and PCA, while avoiding pitfalls from both. We derive new spectral bounds for DCSS. We apply DCSS to two measures of gene expression from two scRNA-Seq experiments with different clustering workflows, and compare to three thresholding methods. In each case study, the clusters based on the small subset of the complete gene expression profile selected by DCSS are similar to clusters produced from the full set. The resulting clusters are informative for cell type.

]]>
<![CDATA[Illusory face detection in pure noise images: The role of interindividual variability in fMRI activation patterns]]> https://www.researchpad.co/article/5c466591d5eed0c484519d0d

Illusory face detection tasks can be used to study the neural correlates of top-down influences on face perception. In a typical functional magnetic resonance imaging (fMRI) study design, subjects are presented with pure noise images, but are told that half of the stimuli contain a face. The illusory face perception network is assessed by comparing blood oxygenation level dependent (BOLD) responses to images in which a face has been detected against BOLD activity related to images in which no face has been detected. In the present study, we highlight the existence of strong interindividual differences of BOLD activation patterns associated with illusory face perception. In the core system of face perception, 4 of 9 subjects had highly significant (p<0.05, corrected for multiple comparisons) activity in the bilateral occipital face area (OFA) and fusiform face area (FFA). In contrast, 5 of 9 subjects did not show any activity in these regions, even at statistical thresholds as liberal as p = 0.05, uncorrected. At the group level, this variability is reflected by non-significant activity in all regions of the core system. We argue that these differences might be related to individual differences in task execution: only some participants really detected faces in the noise images, while the other subjects simply responded in the desired way. This has several implications for future studies on illusory face detection. First, future studies should not only analyze results at the group level, but also for single subjects. Second, subjects should be explicitly queried after the fMRI experiment about whether they really detected faces or not. Third, if possible, not only the overt response of the subject, but also additional parameters that might indicate the perception of a noise stimulus as face should be collected (e.g., behavioral classification images).

]]>
<![CDATA[Clustering algorithms: A comparative approach]]> https://www.researchpad.co/article/5c478c94d5eed0c484bd335e

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

]]>
<![CDATA[Critical evaluation of linear regression models for cell-subtype specific methylation signal from mixed blood cell DNA]]> https://www.researchpad.co/article/5c254557d5eed0c48442c570

Epigenome-wide association studies seek to identify DNA methylation sites associated with clinical outcomes. Difference in observed methylation between specific cell-subtypes is often of interest; however, available samples often comprise a mixture of cells. To date, cell-subtype estimates have been obtained from mixed-cell DNA data using linear regression models, but the accuracy of such estimates has not been critically assessed. We evaluated linear regression performance for cell-subtype specific methylation estimation using a 450K methylation array dataset of both mixed-cell and cell-subtype sorted samples from six healthy males. CpGs associated with each cell-subtype were first identified using t-tests between groups of cell-subtype sorted samples. Subsequent reduced panels of reliably accurate CpGs were identified from mixed-cell samples using an accuracy heuristic (D). Performance was assessed by comparing cell-subtype specific estimates from mixed-cells with corresponding cell-sorted mean using the mean absolute error (MAE) and the Coefficient of Determination (R2). At the cell-subtype level, methylation levels at 3272 CpGs could be estimated to within a MAE of 5% of the expected value. The cell-subtypes with the highest accuracy were CD56+ NK (R2 = 0.56) and CD8+T (R2 = 0.48), where 23% of sites were accurately estimated. Hierarchical clustering and pathways enrichment analysis confirmed the biological relevance of the panels. Our results suggest that linear regression for cell-subtype specific methylation estimation is accurate only for some cell-subtypes at a small fraction of cell-associated sites but may be applicable to EWASs of disease traits with a blood-based pathology. Although sample size was a limitation in this study, we suggest that alternative statistical methods will provide the greatest performance improvements.

]]>
<![CDATA[Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk Neuroblastoma]]> https://www.researchpad.co/article/5c141e9bd5eed0c484d27646

We introduce the CDRP (Concatenated Diagnostic-Relapse Prognostic) architecture for multi-task deep learning that incorporates a clinical algorithm, e.g., a risk stratification schema to improve prognostic profiling. We present the first application to survival prediction in High-Risk (HR) Neuroblastoma from transcriptomics data, a task that studies from the MAQC consortium have shown to remain the hardest among multiple diagnostic and prognostic endpoints predictable from the same dataset. To obtain a more accurate risk stratification needed for appropriate treatment strategies, CDRP combines a first component (CDRP-A) synthesizing a diagnostic task and a second component (CDRP-N) dedicated to one or more prognostic tasks. The approach leverages the advent of semi-supervised deep learning structures that can flexibly integrate multimodal data or internally create multiple processing paths. CDRP-A is an autoencoder trained on gene expression on the HR/non-HR risk stratification by the Children’s Oncology Group, obtaining a 64-node representation in the bottleneck layer. CDRP-N is a multi-task classifier for two prognostic endpoints, i.e., Event-Free Survival (EFS) and Overall Survival (OS). CDRP-A provides the HR embedding input to the CDRP-N shared layer, from which two branches depart to model EFS and OS, respectively. To control for selection bias, CDRP is trained and evaluated using a Data Analysis Protocol (DAP) developed within the MAQC initiative. CDRP was applied on Illumina RNA-Seq of 498 Neuroblastoma patients (HR: 176) from the SEQC study (12,464 Entrez genes) and on Affymetrix Human Exon Array expression profiles (17,450 genes) of 247 primary diagnostic Neuroblastoma of the TARGET NBL cohort. On the SEQC HR patients, CDRP achieves Matthews Correlation Coefficient (MCC) 0.38 for EFS and MCC = 0.19 for OS in external validation, improving over published SEQC models. We show that a CDRP-N embedding is indeed parametrically associated to increasing severity and the embedding can be used to better stratify patients’ survival.

]]>
<![CDATA[powerTCR: A model-based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire]]> https://www.researchpad.co/article/5c08418cd5eed0c484fc9f91

Sequencing of the T cell receptor (TCR) repertoire is a powerful tool for deeper study of immune response, but the unique structure of this type of data makes its meaningful quantification challenging. We introduce a new method, the Gamma-GPD spliced threshold model, to address this difficulty. This biologically interpretable model captures the distribution of the TCR repertoire, demonstrates stability across varying sequencing depths, and permits comparative analysis across any number of sampled individuals. We apply our method to several datasets and obtain insights regarding the differentiating features in the T cell receptor repertoire among sampled individuals across conditions. We have implemented our method in the open-source R package powerTCR.

]]>
<![CDATA[Aging-associated patterns in the expression of human endogenous retroviruses]]> https://www.researchpad.co/article/5c102904d5eed0c484248d1c

Human endogenous retroviruses (HERV) are relics of ancient retroviral infections in our genome. Most of them have lost their coding capacity, but proviral RNA or protein have been observed in several disease states (e.g. in inflammatory and autoimmune diseases and malignancies). However, their clinical significance as well as their mechanisms of action have still remained elusive. As human aging is associated with several biological characteristics of these diseases, we now analyzed the aging-associated expression of the individual proviruses of two HERV families, HERV-K (91 proviruses) and HERV-W (213 proviruses) using genome-wide RNA-sequencing (RNA-seq). RNA was purified from blood cells derived from healthy young individuals (n = 7) and from nonagenarians (n = 7). The data indicated that in the case of HERV-K (HML-2) 33 proviruses had a detectable expression but in only 3 of those the expression levels were significantly different between the young and old individuals. In the HERV-W family expression was observed in 45 loci and only in one case the young/old difference was significant. However, applying hierarchical clustering on the HERV expression data resulted in the formation of two distinct clusters, one containing the young individuals and another the nonagenarians. This suggests, that even though the aging-associated differences in the expression levels of individual proviruses are minor, there seems to be some underlying aging-related pattern. These data indicate that aging does not have a strong effect on the expression of individual HERV proviruses, but instead several proviruses are affected moderately, leading to age-dependent expression profiles.

]]>
<![CDATA[Discriminating severe seasonal allergic rhinitis. Results from a large nation-wide database]]> https://www.researchpad.co/article/5c084234d5eed0c484fcc2c0

Allergic rhinitis (AR) is a chronic disease affecting a large amount of the population. To optimize treatment and disease management, it is crucial to detect patients suffering from severe forms. Several tools have been used to classify patients according to severity: standardized questionnaires, visual analogue scales (VAS) and cluster analysis. The aim of this study was to evaluate the best method to stratify patients suffering from seasonal AR and to propose cut-offs to identify severe forms of the disease. In a multicenter French study (PollinAir), patients suffering from seasonal AR were assessed by a physician that completed a 17 items questionnaire and answered a self-assessment VAS. Five methods were evaluated to stratify patients according to AR severity: k-means clustering, agglomerative hierarchical clustering, Allergic Rhinitis Physician Score (ARPhyS), total symptoms score (TSS-17), and VAS. Fisher linear, quadratic discriminant analysis, non-parametric kernel density estimation methods were used to evaluate miss-classification of the patients and cross-validation was used to assess the validity of each scale. 28,109 patients were categorized into “mild”, “moderate”, and “severe”, through the 5 different methods. The best discrimination was offered by the ARPhyS scale. With the ARPhyS scale, cut-offs at a score of 8–9 for mild to moderate and of 11–12 for moderate to severe symptoms were found. Score reliability was also acceptable (Cronbach’s α coefficient: 0.626) for the ARPhyS scale, and excellent for the TSS-17 (0.864).

The ARPhyS scale seems the best method to target patients with severe seasonal AR. In the present study, we highlighted optimal discrimination cut-offs. This tool could be implemented in daily practice to identify severe patients that need a specialized intervention.

]]>
<![CDATA[A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets]]> https://www.researchpad.co/article/5c06f055d5eed0c484c6d731

Motivation

The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations.

Results

We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a “good” synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms’ characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used.

]]>
<![CDATA[Consequences of impaired 1-MDa TIC complex assembly for the abundance and composition of chloroplast high-molecular mass protein complexes]]> https://www.researchpad.co/article/5c92b376d5eed0c4843a40e7

We report a systematic analysis of chloroplast high-molecular mass protein complexes using a combination of native gel electrophoresis and absolute protein quantification by MSE. With this experimental setup, we characterized the effect of the tic56-3 mutation in the 1-MDa inner envelope translocase (TIC) on the assembly of the chloroplast proteome. We show that the tic56-3 mutation results in a reduction of the 1-MDa TIC complex to approximately 10% of wildtype levels. Hierarchical clustering confirmed the association of malate dehydrogenase (MDH) with an envelope-associated FtsH/FtsHi complex and suggested the association of a glycine-rich protein with the 1-MDa TIC complex. Depletion of this complex leads to a reduction of chloroplast ATPase to approx. 75% of wildtype levels, while the abundance of the FtsH/FtsHi complex is increased to approx. 140% of wildtype. The accumulation of the major photosynthetic complexes is not affected by the mutation, suggesting that tic56-3 plants can sustain a functional photosynthetic machinery despite a significant reduction of the 1-MDa TIC complex. Together our analysis expands recent efforts to catalogue the native molecular masses of chloroplast proteins and provides information on the consequences of impaired accumulation of the 1-MDa TIC translocase for chloroplast proteome assembly.

]]>