ResearchPad - clustering-algorithms Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[The two types of society: Computationally revealing recurrent social formations and their evolutionary trajectories]]> Comparative social science has a long history of attempts to classify societies and cultures in terms of shared characteristics. However, only recently has it become feasible to conduct quantitative analysis of large historical datasets to mathematically approach the study of social complexity and classify shared societal characteristics. Such methods have the potential to identify recurrent social formations in human societies and contribute to social evolutionary theory. However, in order to achieve this potential, repeated studies are needed to assess the robustness of results to changing methods and data sets. Using an improved derivative of the Seshat: Global History Databank, we perform a clustering analysis of 271 past societies from sampling points across the globe to study plausible categorizations inherent in the data. Analysis indicates that the best fit to Seshat data is five subclusters existing as part of two clearly delineated superclusters (that is, two broad “types” of society in terms of social-ecological configuration). Our results add weight to the idea that human societies form recurrent social formations by replicating previous studies with different methods and data. Our results also contribute nuance to previously established measures of social complexity, illustrate diverse trajectories of change, and shed further light on the finite bounds of human social diversity.

<![CDATA[A graph-based algorithm for RNA-seq data normalization]]>

The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.

<![CDATA[Determination of essential phenotypic elements of clusters in high-dimensional entities—DEPECHE]]>

Technological advances have facilitated an exponential increase in the amount of information that can be derived from single cells, necessitating new computational tools that can make such highly complex data interpretable. Here, we introduce DEPECHE, a rapid, parameter free, sparse k-means-based algorithm for clustering of multi- and megavariate single-cell data. In a number of computational benchmarks aimed at evaluating the capacity to form biologically relevant clusters, including flow/mass-cytometry and single cell RNA sequencing data sets with manually curated gold standard solutions, DEPECHE clusters as well or better than the currently available best performing clustering algorithms. However, the main advantage of DEPECHE, compared to the state-of-the-art, is its unique ability to enhance interpretability of the formed clusters, in that it only retains variables relevant for cluster separation, thereby facilitating computational efficient analyses as well as understanding of complex datasets. DEPECHE is implemented in the open source R package DepecheR currently available at

<![CDATA[Regional disparities in maternal and child health indicators: Cluster analysis of districts in Bangladesh]]>

Efforts to mitigate public health concerns are showing encouraging results over the time but disparities across the geographic regions still exist within countries. Inadequate researches on the regional disparities of health indicators based on representative and comparable data create challenges to develop evidence-based health policies, planning and future studies in developing countries like Bangladesh. This study examined the disparities among districts on various maternal and child health indicators in Bangladesh. Cluster analysis–an unsupervised learning technique was used based on nationally representative dataset originated from Multiple Indicator Cluster Survey (MICS), 2012–13. According to our results, Bangladesh is classified into two clusters based on different health indicators with substantial variations in districts per clusters for different sets of indicators suggesting regional variation across the indicators. There is a need to differentially focus on community-level interventions aimed at increasing maternal and child health care utilization and improving the socioeconomic position of mothers, especially in disadvantaged regions. The cluster analysis approach is unique in terms of the use of health care metrics in a multivariate setup to study regional similarity and dissimilarity in the context of Bangladesh.

<![CDATA[The organization of leukotriene biosynthesis on the nuclear envelope revealed by single molecule localization microscopy and computational analyses]]>

The initial steps in the synthesis of leukotrienes are the translocation of 5-lipoxygenase (5-LO) to the nuclear envelope and its subsequent association with its scaffold protein 5-lipoxygenase-activating protein (FLAP). A major gap in our understanding of this process is the knowledge of how the organization of 5-LO and FLAP on the nuclear envelope regulates leukotriene synthesis. We combined single molecule localization microscopy with Clus-DoC cluster analysis, and also a novel unbiased cluster analysis to analyze changes in the relationships between 5-LO and FLAP in response to activation of RBL-2H3 cells to generate leukotriene C4. We identified the time-dependent reorganization of both 5-LO and FLAP into higher-order assemblies or clusters in response to cell activation via the IgE receptor. Clus-DoC analysis identified a subset of these clusters with a high degree of interaction between 5-LO and FLAP that specifically correlates with the time course of LTC4 synthesis, strongly suggesting their role in the initiation of leukotriene biosynthesis.

<![CDATA[Clustering algorithms: A comparative approach]]>

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

<![CDATA[Drinking and driving relapse: Data from BAC and MMPI-2]]>

Road traffic injuries are the ninth cause of death across all age groups, globally (WHO, 2015). Many road traffic crashes are caused by Driving Under the Influence (DUI) of alcohol by persons who have previously had their license suspended for the same reason. The aim of this study was to identify specific risk factors and personality characteristics in repeat offenders. The sample was comprised of 260 subjects who were not repeat DUI offenders (DUI-NR), but had a single license suspension between 2010 and 2011; and 97 repeat offenders who received at least two DUI convictions within a period of 5 years. At the time of their first driving license suspension, participants provided their blood alcohol content (BAC) and completed a valid MMPI-2 test. ANOVA and MANOVAs were performed to determine whether there were significant differences in BAC and MMPI-2 profiles between DUI-NR and DUI-R participants and a logistic regression was run to identify whether BAC at the time of the first suspension and specific personality features could predict recidivism. A two-step cluster analysis was run to identify recidivist typologies. Results showed that, relative to DUI-NR participants, DUI-R participants had higher BAC at the time of their first conviction and more problematic MMPI-2 profiles, despite the presence of social desirability responding. The best predictors of recidivism were BAC and the scales of Lie (L), Correction (K), Psychopathic Deviate (4-Pd), Hypomania (9-Ma), and Low Self-Esteem (LSE). Two-step cluster analyses identified two recidivist profiles, according to 32 selected MMPI-2 validity, clinical, content, supplementary, and PSY-5 scales. Comparisons with previous research are discussed and ideas for further study are generated.

<![CDATA[Modular structure in fish co-occurrence networks: A comparison across spatial scales and grouping methodologies]]>

Network modules are used for diverse purposes, ranging from delineation of biogeographical provinces to the study of biotic interactions. We assess spatial scaling effects on modular structure, using a multi-step process to compare fish co-occurrence networks at three nested scales. We first detect modules with simulated annealing and use spatial clustering tests (interspecific distances among species’ range centroids) to determine if modules consist of species with broadly overlapping ranges; strong spatial clustering may reflect environmental filtering, while absence of spatial clustering may reflect positive interspecific relationships (commensalism or mutualism). We then use non-hierarchical, multivariate cluster analysis as an alternative method to identify fish subgroups, we repeat spatial clustering tests for the multivariate clusters, then compare spatial clustering results among modules and clusters. Next, we compare species lists within modules and clusters, and estimate congruence as the proportion of species assigned to the same groups by the two methods. Finally, we use a well-documented nest associate complex (fishes that deposit eggs in the gravel nests of a common host) to assess whether strong within-group associations may, in fact, reflect positive interspecific relationships. At each scale, 2–4 network modules were detected but a consistent relationship between scale and the number of modules was not observed. Significant spatial clustering was detected at all scales for network modules and multivariate clusters but was less prevalent at smaller scales. Congruence between modules and clusters was always < 90% and generally decreased as the number of groups increased. At all scales, the complete nest associate complex was completely preserved within a single network module, but not within a single multivariate cluster. Collectively, our results suggest that network modules are promising tools for studying positive interactions and that smaller scales may be preferable in this research.

<![CDATA[An improved DBSCAN algorithm based on cell-like P systems with promoters and inhibitors]]>

Density-based spatial clustering of applications with noise (DBSCAN) algorithm can find clusters of arbitrary shape, while the noise points can be removed. Membrane computing is a novel research branch of bio-inspired computing, which seeks to discover new computational models/framework from biological cells. The obtained parallel and distributed computing models are usually called P systems. In this work, DBSCAN algorithm is improved by using parallel evolution mechanism and hierarchical membrane structure in cell-like P systems with promoters and inhibitors, where promoters and inhibitors are utilized to regulate parallelism of objects evolution. Experiment results show that the proposed algorithm performs well in big cluster analysis. The time complexity is improved to O(n), in comparison with conventional DBSCAN of O(n2). The results give some hints to improve conventional algorithms by using the hierarchical framework and parallel evolution mechanism in membrane computing models.

<![CDATA[Discriminating severe seasonal allergic rhinitis. Results from a large nation-wide database]]>

Allergic rhinitis (AR) is a chronic disease affecting a large amount of the population. To optimize treatment and disease management, it is crucial to detect patients suffering from severe forms. Several tools have been used to classify patients according to severity: standardized questionnaires, visual analogue scales (VAS) and cluster analysis. The aim of this study was to evaluate the best method to stratify patients suffering from seasonal AR and to propose cut-offs to identify severe forms of the disease. In a multicenter French study (PollinAir), patients suffering from seasonal AR were assessed by a physician that completed a 17 items questionnaire and answered a self-assessment VAS. Five methods were evaluated to stratify patients according to AR severity: k-means clustering, agglomerative hierarchical clustering, Allergic Rhinitis Physician Score (ARPhyS), total symptoms score (TSS-17), and VAS. Fisher linear, quadratic discriminant analysis, non-parametric kernel density estimation methods were used to evaluate miss-classification of the patients and cross-validation was used to assess the validity of each scale. 28,109 patients were categorized into “mild”, “moderate”, and “severe”, through the 5 different methods. The best discrimination was offered by the ARPhyS scale. With the ARPhyS scale, cut-offs at a score of 8–9 for mild to moderate and of 11–12 for moderate to severe symptoms were found. Score reliability was also acceptable (Cronbach’s α coefficient: 0.626) for the ARPhyS scale, and excellent for the TSS-17 (0.864).

The ARPhyS scale seems the best method to target patients with severe seasonal AR. In the present study, we highlighted optimal discrimination cut-offs. This tool could be implemented in daily practice to identify severe patients that need a specialized intervention.

<![CDATA[Physicochemical characteristics and high sensory acceptability in cappuccinos made with jackfruit seeds replacing cocoa powder]]>

Jackfruit seeds are an under-utilized waste product in many tropical countries. In this work, we demonstrate the potential of roasted jackfruit seeds to substitute for cocoa powder in cappuccino formulations. Two different flours were produced from a hard variety jackfruit by drying or fermenting the seeds prior to roasting. Next, formulations were prepared with 50%, 75%, and 100% substitution of cocoa powder with jackfruit seed flours, totalizing seven with control formulation. The acceptance of cappuccinos by consumers (n = 126) and quantitative descriptive analysis (QDA®) were used to describe the preparations. Physicochemical properties were also evaluated. When 50% and 75% cocoa powder was replaced with dry jackfruit seed flour, there was no change in sensory acceptability or technological properties; however, it is possible to identify advantages tousing dry jackfruit seed flour, including moisture reduction and high wettability, solubility and sensory acceptation of the chocolate aroma. The principal component analysis of QDA explained90% variances; cluster analysis enabled the definition of four groups for six cappuccino preparations. In fact, dry jackfruit seed flour is an innovative cocoa powder substitute; it could be used in food preparations, consequently utilizing this tropical fruit waste by incorporating it as an ingredient in a common product of the human diet.

<![CDATA[Ethnicity estimation using family naming practices]]>

This paper examines the association between given and family names and self-ascribed ethnicity as classified by the 2011 Census of Population for England and Wales. Using Census data in an innovative way under the new Office for National Statistics (ONS) Secure Research Service (SRS; previously the ONS Virtual Microdata Laboratory, VML), we investigate how bearers of a full range of given and family names assigned themselves to 2011 Census categories, using a names classification tool previously described in this journal. Based on these results, we develop a follow-up ethnicity estimation tool and describe how the tool may be used to observe changing relations between naming practices and ethnic identities as a facet of social integration and cosmopolitanism in an increasingly diverse society.

<![CDATA[Classification of primary angle closure spectrum with hierarchical cluster analysis]]>


To classify subjects with primary angle closure into clusters based on features from anterior segment optical coherence tomography (ASOCT) imaging and to explore how these clusters correspond to disease subtypes, including primary angle closure suspect (PACS), primary angle closure glaucoma(PACG), acute primary angle closure (APAC) and fellow eyes of APAC and reveal the factors that become more predominant in each subtype of angle closure.


A cross-sectional study of 248 eyes of 198 subjects(88 PACS eyes, 53 PACG eyes, 54 APAC eyes and 53 fellow eyes of APAC) that underwent complete examination including gonioscopy, A-scan biometry, and ASOCT. An agglomerative hierarchical clustering method was used to classify eyes based on ASOCT parameters.


Statistical clustering analysis produced three clusters among which the anterior segment parameters were significantly different. Cluster 1(43 eyes) had the smallest anterior chamber depth(ACD) and area, as well as the greatest lens vault (p<0.001 for all). Cluster 2(113 eyes) had the thickest iris at 2000 microns(p = 0.048), and largest iris area(p<0.001), and the deepest ACD (p<0.001). Cluster 3(92 eyes) was characterized by elements of both clusters 1 and 2 and a higher iris curvature(p<0.001). There was a statistically significant difference in the distribution of clusters among subtypes of angle closure eyes(p<0.001). Although the patterns of clusters were similar in PACS and PACG eyes, with the majority of the eyes classified into cluster 2(55%, and 62%, respectively), the highest proportion of APAC and fellow eyes were assigned to clusters 1(44%) and 3 (51%), respectively.


Hierarchical cluster analysis identified three clusters with different features. Predominant anatomical components are different among subtypes of primary angle closure.

<![CDATA[Unsupervised clustering of temporal patterns in high-dimensional neuronal ensembles using a novel dissimilarity measure]]>

Temporally ordered multi-neuron patterns likely encode information in the brain. We introduce an unsupervised method, SPOTDisClust (Spike Pattern Optimal Transport Dissimilarity Clustering), for their detection from high-dimensional neural ensembles. SPOTDisClust measures similarity between two ensemble spike patterns by determining the minimum transport cost of transforming their corresponding normalized cross-correlation matrices into each other (SPOTDis). Then, it performs density-based clustering based on the resulting inter-pattern dissimilarity matrix. SPOTDisClust does not require binning and can detect complex patterns (beyond sequential activation) even when high levels of out-of-pattern “noise” spiking are present. Our method handles efficiently the additional information from increasingly large neuronal ensembles and can detect a number of patterns that far exceeds the number of recorded neurons. In an application to neural ensemble data from macaque monkey V1 cortex, SPOTDisClust can identify different moving stimulus directions on the sole basis of temporal spiking patterns.

<![CDATA[Robust auto-weighted multi-view subspace clustering with common subspace representation matrix]]>

In many computer vision and machine learning applications, the data sets distribute on certain low-dimensional subspaces. Subspace clustering is a powerful technology to find the underlying subspaces and cluster data points correctly. However, traditional subspace clustering methods can only be applied on data from one source, and how to extend these methods and enable the extensions to combine information from various data sources has become a hot area of research. Previous multi-view subspace methods aim to learn multiple subspace representation matrices simultaneously and these learning task for different views are treated equally. After obtaining representation matrices, they stack up the learned representation matrices as the common underlying subspace structure. However, for many problems, the importance of sources and the importance of features in one source both can be varied, which makes the previous approaches ineffective. In this paper, we propose a novel method called Robust Auto-weighted Multi-view Subspace Clustering (RAMSC). In our method, the weight for both the sources and features can be learned automatically via utilizing a novel trick and introducing a sparse norm. More importantly, the objective of our method is a common representation matrix which directly reflects the common underlying subspace structure. A new efficient algorithm is derived to solve the formulated objective with rigorous theoretical proof on its convergency. Extensive experimental results on five benchmark multi-view datasets well demonstrate that the proposed method consistently outperforms the state-of-the-art methods.

<![CDATA[Utility and Limitations of Using Gene Expression Data to Identify Functional Associations]]>

Gene co-expression has been widely used to hypothesize gene function through guilt-by association. However, it is not clear to what degree co-expression is informative, whether it can be applied to genes involved in different biological processes, and how the type of dataset impacts inferences about gene functions. Here our goal is to assess the utility and limitations of using co-expression as a criterion to recover functional associations between genes. By determining the percentage of gene pairs in a metabolic pathway with significant expression correlation, we found that many genes in the same pathway do not have similar transcript profiles and the choice of dataset, annotation quality, gene function, expression similarity measure, and clustering approach significantly impacts the ability to recover functional associations between genes using Arabidopsis thaliana as an example. Some datasets are more informative in capturing coordinated expression profiles and larger data sets are not always better. In addition, to recover the maximum number of known pathways and identify candidate genes with similar functions, it is important to explore rather exhaustively multiple dataset combinations, similarity measures, clustering algorithms and parameters. Finally, we validated the biological relevance of co-expression cluster memberships with an independent phenomics dataset and found that genes that consistently cluster with leucine degradation genes tend to have similar leucine levels in mutants. This study provides a framework for obtaining gene functional associations by maximizing the information that can be obtained from gene expression datasets.

<![CDATA[Clustering cancer gene expression data by projective clustering ensemble]]>

Gene expression data analysis has paramount implications for gene treatments, cancer diagnosis and other domains. Clustering is an important and promising tool to analyze gene expression data. Gene expression data is often characterized by a large amount of genes but with limited samples, thus various projective clustering techniques and ensemble techniques have been suggested to combat with these challenges. However, it is rather challenging to synergy these two kinds of techniques together to avoid the curse of dimensionality problem and to boost the performance of gene expression data clustering. In this paper, we employ a projective clustering ensemble (PCE) to integrate the advantages of projective clustering and ensemble clustering, and to avoid the dilemma of combining multiple projective clusterings. Our experimental results on publicly available cancer gene expression data show PCE can improve the quality of clustering gene expression data by at least 4.5% (on average) than other related techniques, including dimensionality reduction based single clustering and ensemble approaches. The empirical study demonstrates that, to further boost the performance of clustering cancer gene expression data, it is necessary and promising to synergy projective clustering with ensemble clustering. PCE can serve as an effective alternative technique for clustering gene expression data.

<![CDATA[What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm]]>

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

<![CDATA[Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm]]>

In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.

<![CDATA[Electrosensory neural responses to natural electro-communication stimuli are distributed along a continuum]]>

Neural heterogeneities are seen ubiquitously within the brain and greatly complicate classification efforts. Here we tested whether the responses of an anatomically well-characterized sensory neuron population to natural stimuli could be used for functional classification. To do so, we recorded from pyramidal cells within the electrosensory lateral line lobe (ELL) of the weakly electric fish Apteronotus leptorhynchus in response to natural electro-communication stimuli as these cells can be anatomically classified into six different types. We then used two independent methodologies to functionally classify responses: one relies of reducing the dimensionality of a feature space while the other directly compares the responses themselves. Both methodologies gave rise to qualitatively similar results: while ON and OFF-type cells could easily be distinguished from one another, ELL pyramidal neuron responses are actually distributed along a continuum rather than forming distinct clusters due to heterogeneities. We discuss the implications of our results for neural coding and highlight some potential advantages.