ResearchPad - hierarchical-clustering https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[COMBSecretomics: A pragmatic methodological framework for higher-order drug combination analysis using secretomics]]> https://www.researchpad.co/article/elastic_article_14596 Multi drug treatments are increasingly used in the clinic to combat complex and co-occurring diseases. However, most drug combination discovery efforts today are mainly focused on anticancer therapy and rarely examine the potential of using more than two drugs simultaneously. Moreover, there is currently no reported methodology for performing second- and higher-order drug combination analysis of secretomic patterns, meaning protein concentration profiles released by the cells. Here, we introduce COMBSecretomics (https://github.com/EffieChantzi/COMBSecretomics.git), the first pragmatic methodological framework designed to search exhaustively for second- and higher-order mixtures of candidate treatments that can modify, or even reverse malfunctioning secretomic patterns of human cells. This framework comes with two novel model-free combination analysis methods; a tailor-made generalization of the highest single agent principle and a data mining approach based on top-down hierarchical clustering. Quality control procedures to eliminate outliers and non-parametric statistics to quantify uncertainty in the results obtained are also included. COMBSecretomics is based on a standardized reproducible format and could be employed with any experimental platform that provides the required protein release data. Its practical use and functionality are demonstrated by means of a proof-of-principle pharmacological study related to cartilage degradation. COMBSecretomics is the first methodological framework reported to enable secretome-related second- and higher-order drug combination analysis. It could be used in drug discovery and development projects, clinical practice, as well as basic biological understanding of the largely unexplored changes in cell-cell communication that occurs due to disease and/or associated pharmacological treatment conditions.

]]>
<![CDATA[Implementation of maternity protection legislation: Gynecologists’ perceptions and practices in French-speaking Switzerland]]> https://www.researchpad.co/article/elastic_article_11226 In several countries, maternity protection legislations (MPL) confer an essential role to gynecologist-obstetricians (OBGYNs) for the protection of pregnant workers and their future children from occupational exposures. This study explores OBGYNs’ practices and difficulties in implementing MPL in the French-speaking part of Switzerland.MethodsAn online survey was sent to 333 OBGYNs. Data analysis included: 1) descriptive and correlational statistics and 2) hierarchical cluster analysis to identify patterns of practices.ResultsOBGYNs evoked several problems in MPL implementation: absence of risk analysis in the companies, difficult collaboration with employers, lack of competencies in the field of occupational health. Preventive leave was underused, with sick leave being prescribed instead. Training had a positive effect on OBGYNs’ knowledge and implementation of MPL. Hierarchical cluster analysis highlighted three main types of practices: 1) practice in line with legislation; 2) practice on a case-by-case basis; 3) limited practice. OBGYNs with good knowledge of MPL more consistently applied its provisions.ConclusionThe implementation of MPL appears challenging for OBGYNs. Collaboration with occupational physicians and training might help OBGYNs to better take on their role in maternity protection. MPL in itself could be improved. ]]> <![CDATA[Illusory face detection in pure noise images: The role of interindividual variability in fMRI activation patterns]]> https://www.researchpad.co/article/5c466591d5eed0c484519d0d

Illusory face detection tasks can be used to study the neural correlates of top-down influences on face perception. In a typical functional magnetic resonance imaging (fMRI) study design, subjects are presented with pure noise images, but are told that half of the stimuli contain a face. The illusory face perception network is assessed by comparing blood oxygenation level dependent (BOLD) responses to images in which a face has been detected against BOLD activity related to images in which no face has been detected. In the present study, we highlight the existence of strong interindividual differences of BOLD activation patterns associated with illusory face perception. In the core system of face perception, 4 of 9 subjects had highly significant (p<0.05, corrected for multiple comparisons) activity in the bilateral occipital face area (OFA) and fusiform face area (FFA). In contrast, 5 of 9 subjects did not show any activity in these regions, even at statistical thresholds as liberal as p = 0.05, uncorrected. At the group level, this variability is reflected by non-significant activity in all regions of the core system. We argue that these differences might be related to individual differences in task execution: only some participants really detected faces in the noise images, while the other subjects simply responded in the desired way. This has several implications for future studies on illusory face detection. First, future studies should not only analyze results at the group level, but also for single subjects. Second, subjects should be explicitly queried after the fMRI experiment about whether they really detected faces or not. Third, if possible, not only the overt response of the subject, but also additional parameters that might indicate the perception of a noise stimulus as face should be collected (e.g., behavioral classification images).

]]>
<![CDATA[Clustering algorithms: A comparative approach]]> https://www.researchpad.co/article/5c478c94d5eed0c484bd335e

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

]]>
<![CDATA[Critical evaluation of linear regression models for cell-subtype specific methylation signal from mixed blood cell DNA]]> https://www.researchpad.co/article/5c254557d5eed0c48442c570

Epigenome-wide association studies seek to identify DNA methylation sites associated with clinical outcomes. Difference in observed methylation between specific cell-subtypes is often of interest; however, available samples often comprise a mixture of cells. To date, cell-subtype estimates have been obtained from mixed-cell DNA data using linear regression models, but the accuracy of such estimates has not been critically assessed. We evaluated linear regression performance for cell-subtype specific methylation estimation using a 450K methylation array dataset of both mixed-cell and cell-subtype sorted samples from six healthy males. CpGs associated with each cell-subtype were first identified using t-tests between groups of cell-subtype sorted samples. Subsequent reduced panels of reliably accurate CpGs were identified from mixed-cell samples using an accuracy heuristic (D). Performance was assessed by comparing cell-subtype specific estimates from mixed-cells with corresponding cell-sorted mean using the mean absolute error (MAE) and the Coefficient of Determination (R2). At the cell-subtype level, methylation levels at 3272 CpGs could be estimated to within a MAE of 5% of the expected value. The cell-subtypes with the highest accuracy were CD56+ NK (R2 = 0.56) and CD8+T (R2 = 0.48), where 23% of sites were accurately estimated. Hierarchical clustering and pathways enrichment analysis confirmed the biological relevance of the panels. Our results suggest that linear regression for cell-subtype specific methylation estimation is accurate only for some cell-subtypes at a small fraction of cell-associated sites but may be applicable to EWASs of disease traits with a blood-based pathology. Although sample size was a limitation in this study, we suggest that alternative statistical methods will provide the greatest performance improvements.

]]>
<![CDATA[Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk Neuroblastoma]]> https://www.researchpad.co/article/5c141e9bd5eed0c484d27646

We introduce the CDRP (Concatenated Diagnostic-Relapse Prognostic) architecture for multi-task deep learning that incorporates a clinical algorithm, e.g., a risk stratification schema to improve prognostic profiling. We present the first application to survival prediction in High-Risk (HR) Neuroblastoma from transcriptomics data, a task that studies from the MAQC consortium have shown to remain the hardest among multiple diagnostic and prognostic endpoints predictable from the same dataset. To obtain a more accurate risk stratification needed for appropriate treatment strategies, CDRP combines a first component (CDRP-A) synthesizing a diagnostic task and a second component (CDRP-N) dedicated to one or more prognostic tasks. The approach leverages the advent of semi-supervised deep learning structures that can flexibly integrate multimodal data or internally create multiple processing paths. CDRP-A is an autoencoder trained on gene expression on the HR/non-HR risk stratification by the Children’s Oncology Group, obtaining a 64-node representation in the bottleneck layer. CDRP-N is a multi-task classifier for two prognostic endpoints, i.e., Event-Free Survival (EFS) and Overall Survival (OS). CDRP-A provides the HR embedding input to the CDRP-N shared layer, from which two branches depart to model EFS and OS, respectively. To control for selection bias, CDRP is trained and evaluated using a Data Analysis Protocol (DAP) developed within the MAQC initiative. CDRP was applied on Illumina RNA-Seq of 498 Neuroblastoma patients (HR: 176) from the SEQC study (12,464 Entrez genes) and on Affymetrix Human Exon Array expression profiles (17,450 genes) of 247 primary diagnostic Neuroblastoma of the TARGET NBL cohort. On the SEQC HR patients, CDRP achieves Matthews Correlation Coefficient (MCC) 0.38 for EFS and MCC = 0.19 for OS in external validation, improving over published SEQC models. We show that a CDRP-N embedding is indeed parametrically associated to increasing severity and the embedding can be used to better stratify patients’ survival.

]]>
<![CDATA[powerTCR: A model-based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire]]> https://www.researchpad.co/article/5c08418cd5eed0c484fc9f91

Sequencing of the T cell receptor (TCR) repertoire is a powerful tool for deeper study of immune response, but the unique structure of this type of data makes its meaningful quantification challenging. We introduce a new method, the Gamma-GPD spliced threshold model, to address this difficulty. This biologically interpretable model captures the distribution of the TCR repertoire, demonstrates stability across varying sequencing depths, and permits comparative analysis across any number of sampled individuals. We apply our method to several datasets and obtain insights regarding the differentiating features in the T cell receptor repertoire among sampled individuals across conditions. We have implemented our method in the open-source R package powerTCR.

]]>
<![CDATA[Aging-associated patterns in the expression of human endogenous retroviruses]]> https://www.researchpad.co/article/5c102904d5eed0c484248d1c

Human endogenous retroviruses (HERV) are relics of ancient retroviral infections in our genome. Most of them have lost their coding capacity, but proviral RNA or protein have been observed in several disease states (e.g. in inflammatory and autoimmune diseases and malignancies). However, their clinical significance as well as their mechanisms of action have still remained elusive. As human aging is associated with several biological characteristics of these diseases, we now analyzed the aging-associated expression of the individual proviruses of two HERV families, HERV-K (91 proviruses) and HERV-W (213 proviruses) using genome-wide RNA-sequencing (RNA-seq). RNA was purified from blood cells derived from healthy young individuals (n = 7) and from nonagenarians (n = 7). The data indicated that in the case of HERV-K (HML-2) 33 proviruses had a detectable expression but in only 3 of those the expression levels were significantly different between the young and old individuals. In the HERV-W family expression was observed in 45 loci and only in one case the young/old difference was significant. However, applying hierarchical clustering on the HERV expression data resulted in the formation of two distinct clusters, one containing the young individuals and another the nonagenarians. This suggests, that even though the aging-associated differences in the expression levels of individual proviruses are minor, there seems to be some underlying aging-related pattern. These data indicate that aging does not have a strong effect on the expression of individual HERV proviruses, but instead several proviruses are affected moderately, leading to age-dependent expression profiles.

]]>
<![CDATA[Discriminating severe seasonal allergic rhinitis. Results from a large nation-wide database]]> https://www.researchpad.co/article/5c084234d5eed0c484fcc2c0

Allergic rhinitis (AR) is a chronic disease affecting a large amount of the population. To optimize treatment and disease management, it is crucial to detect patients suffering from severe forms. Several tools have been used to classify patients according to severity: standardized questionnaires, visual analogue scales (VAS) and cluster analysis. The aim of this study was to evaluate the best method to stratify patients suffering from seasonal AR and to propose cut-offs to identify severe forms of the disease. In a multicenter French study (PollinAir), patients suffering from seasonal AR were assessed by a physician that completed a 17 items questionnaire and answered a self-assessment VAS. Five methods were evaluated to stratify patients according to AR severity: k-means clustering, agglomerative hierarchical clustering, Allergic Rhinitis Physician Score (ARPhyS), total symptoms score (TSS-17), and VAS. Fisher linear, quadratic discriminant analysis, non-parametric kernel density estimation methods were used to evaluate miss-classification of the patients and cross-validation was used to assess the validity of each scale. 28,109 patients were categorized into “mild”, “moderate”, and “severe”, through the 5 different methods. The best discrimination was offered by the ARPhyS scale. With the ARPhyS scale, cut-offs at a score of 8–9 for mild to moderate and of 11–12 for moderate to severe symptoms were found. Score reliability was also acceptable (Cronbach’s α coefficient: 0.626) for the ARPhyS scale, and excellent for the TSS-17 (0.864).

The ARPhyS scale seems the best method to target patients with severe seasonal AR. In the present study, we highlighted optimal discrimination cut-offs. This tool could be implemented in daily practice to identify severe patients that need a specialized intervention.

]]>
<![CDATA[A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets]]> https://www.researchpad.co/article/5c06f055d5eed0c484c6d731

Motivation

The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations.

Results

We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a “good” synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms’ characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used.

]]>
<![CDATA[Consequences of impaired 1-MDa TIC complex assembly for the abundance and composition of chloroplast high-molecular mass protein complexes]]> https://www.researchpad.co/article/5c92b376d5eed0c4843a40e7

We report a systematic analysis of chloroplast high-molecular mass protein complexes using a combination of native gel electrophoresis and absolute protein quantification by MSE. With this experimental setup, we characterized the effect of the tic56-3 mutation in the 1-MDa inner envelope translocase (TIC) on the assembly of the chloroplast proteome. We show that the tic56-3 mutation results in a reduction of the 1-MDa TIC complex to approximately 10% of wildtype levels. Hierarchical clustering confirmed the association of malate dehydrogenase (MDH) with an envelope-associated FtsH/FtsHi complex and suggested the association of a glycine-rich protein with the 1-MDa TIC complex. Depletion of this complex leads to a reduction of chloroplast ATPase to approx. 75% of wildtype levels, while the abundance of the FtsH/FtsHi complex is increased to approx. 140% of wildtype. The accumulation of the major photosynthetic complexes is not affected by the mutation, suggesting that tic56-3 plants can sustain a functional photosynthetic machinery despite a significant reduction of the 1-MDa TIC complex. Together our analysis expands recent efforts to catalogue the native molecular masses of chloroplast proteins and provides information on the consequences of impaired accumulation of the 1-MDa TIC translocase for chloroplast proteome assembly.

]]>