ResearchPad - preprocessing https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[OtoMatch: Content-based eardrum image retrieval using deep learning]]> https://www.researchpad.co/article/elastic_article_14747 Acute infections of the middle ear are the most commonly treated childhood diseases. Because complications affect children’s language learning and cognitive processes, it is essential to diagnose these diseases in a timely and accurate manner. The prevailing literature suggests that it is difficult to accurately diagnose these infections, even for experienced ear, nose, and throat (ENT) physicians. Advanced care practitioners (e.g., nurse practitioners, physician assistants) serve as first-line providers in many primary care settings and may benefit from additional guidance to appropriately determine the diagnosis and treatment of ear diseases. For this purpose, we designed a content-based image retrieval (CBIR) system (called OtoMatch) for normal, middle ear effusion, and tympanostomy tube conditions, operating on eardrum images captured with a digital otoscope. We present a method that enables the conversion of any convolutional neural network (trained for classification) into an image retrieval model. As a proof of concept, we converted a pre-trained deep learning model into an image retrieval system. We accomplished this by changing the fully connected layers into lookup tables. A database of 454 labeled eardrum images (179 normal, 179 effusion, and 96 tube cases) was used to train and test the system. On a 10-fold cross validation, the proposed method resulted in an average accuracy of 80.58% (SD 5.37%), and maximum F1 score of 0.90 while retrieving the most similar image from the database. These are promising results for the first study to demonstrate the feasibility of developing a CBIR system for eardrum images using the newly proposed methodology.

]]>
<![CDATA[Deep learning assisted detection of glaucomatous optic neuropathy and potential designs for a generalizable model]]> https://www.researchpad.co/article/elastic_article_14620 To evaluate ways to improve the generalizability of a deep learning algorithm for identifying glaucomatous optic neuropathy (GON) using a limited number of fundus photographs, as well as the key features being used for classification.MethodsA total of 944 fundus images from Taipei Veterans General Hospital (TVGH) were retrospectively collected. Clinical and demographic characteristics, including structural and functional measurements of the images with GON, were recorded. Transfer learning based on VGGNet was used to construct a convolutional neural network (CNN) to identify GON. To avoid missing cases with advanced GON, an ensemble model was adopted in which a support vector machine classifier would make final classification based on cup-to-disc ratio if the CNN classifier had low-confidence score. The CNN classifier was first established using TVGH dataset, and then fine-tuned by combining the training images of TVGH and Drishti-GS datasets. Class activation map (CAM) was used to identify key features used for CNN classification. Performance of each classifier was determined through area under receiver operating characteristic curve (AUC) and compared with the ensemble model by diagnostic accuracy.ResultsIn 187 TVGH test images, the accuracy, sensitivity, and specificity of the CNN classifier were 95.0%, 95.7%, and 94.2%, respectively, and the AUC was 0.992 compared to the 92.8% accuracy rate of the ensemble model. For the Drishti-GS test images, the accuracy of the CNN, the fine-tuned CNN and ensemble model was 33.3%, 80.3%, and 80.3%, respectively. The CNN classifier did not misclassify images with moderate to severe diseases. Class-discriminative regions revealed by CAM co-localized with known characteristics of GON.ConclusionsThe ensemble model or a fine-tuned CNN classifier may be potential designs to build a generalizable deep learning model for glaucoma detection when large image databases are not available. ]]> <![CDATA[pyKNEEr: An image analysis workflow for open and reproducible research on femoral knee cartilage]]> https://www.researchpad.co/article/N0686bd46-1746-4f66-8610-270f1b75b482

Transparent research in musculoskeletal imaging is fundamental to reliably investigate diseases such as knee osteoarthritis (OA), a chronic disease impairing femoral knee cartilage. To study cartilage degeneration, researchers have developed algorithms to segment femoral knee cartilage from magnetic resonance (MR) images and to measure cartilage morphology and relaxometry. The majority of these algorithms are not publicly available or require advanced programming skills to be compiled and run. However, to accelerate discoveries and findings, it is crucial to have open and reproducible workflows. We present pyKNEEr, a framework for open and reproducible research on femoral knee cartilage from MR images. pyKNEEr is written in python, uses Jupyter notebook as a user interface, and is available on GitHub with a GNU GPLv3 license. It is composed of three modules: 1) image preprocessing to standardize spatial and intensity characteristics; 2) femoral knee cartilage segmentation for intersubject, multimodal, and longitudinal acquisitions; and 3) analysis of cartilage morphology and relaxometry. Each module contains one or more Jupyter notebooks with narrative, code, visualizations, and dependencies to reproduce computational environments. pyKNEEr facilitates transparent image-based research of femoral knee cartilage because of its ease of installation and use, and its versatility for publication and sharing among researchers. Finally, due to its modular structure, pyKNEEr favors code extension and algorithm comparison. We tested our reproducible workflows with experiments that also constitute an example of transparent research with pyKNEEr, and we compared pyKNEEr performances to existing algorithms in literature review visualizations. We provide links to executed notebooks and executable environments for immediate reproducibility of our findings.

]]>
<![CDATA[STRFs in primary auditory cortex emerge from masking-based statistics of natural sounds]]> https://www.researchpad.co/article/5c4a3057d5eed0c4844bfd7a

We investigate how the neural processing in auditory cortex is shaped by the statistics of natural sounds. Hypothesising that auditory cortex (A1) represents the structural primitives out of which sounds are composed, we employ a statistical model to extract such components. The input to the model are cochleagrams which approximate the non-linear transformations a sound undergoes from the outer ear, through the cochlea to the auditory nerve. Cochleagram components do not superimpose linearly, but rather according to a rule which can be approximated using the max function. This is a consequence of the compression inherent in the cochleagram and the sparsity of natural sounds. Furthermore, cochleagrams do not have negative values. Cochleagrams are therefore not matched well by the assumptions of standard linear approaches such as sparse coding or ICA. We therefore consider a new encoding approach for natural sounds, which combines a model of early auditory processing with maximal causes analysis (MCA), a sparse coding model which captures both the non-linear combination rule and non-negativity of the data. An efficient truncated EM algorithm is used to fit the MCA model to cochleagram data. We characterize the generative fields (GFs) inferred by MCA with respect to in vivo neural responses in A1 by applying reverse correlation to estimate spectro-temporal receptive fields (STRFs) implied by the learned GFs. Despite the GFs being non-negative, the STRF estimates are found to contain both positive and negative subfields, where the negative subfields can be attributed to explaining away effects as captured by the applied inference method. A direct comparison with ferret A1 shows many similar forms, and the spectral and temporal modulation tuning of both ferret and model STRFs show similar ranges over the population. In summary, our model represents an alternative to linear approaches for biological auditory encoding while it captures salient data properties and links inhibitory subfields to explaining away effects.

]]>
<![CDATA[Plant leaf tooth feature extraction]]> https://www.researchpad.co/article/5c6dc9b6d5eed0c48452a077

Leaf tooth can indicate several systematically informative features and is extremely useful for circumscribing fossil leaf taxa. Moreover, it can help discriminate species or even higher taxa accurately. Previous studies extract features that are not strictly defined in botany; therefore, a uniform standard to compare the accuracies of various feature extraction methods cannot be used. For efficient and automatic retrieval of plant leaves from a leaf database, in this study, we propose an image-based description and measurement of leaf teeth by referring to the leaf structure classification system in botany. First, image preprocessing is carried out to obtain a binary map of plant leaves. Then, corner detection based on the curvature scale-space (CSS) algorithm is used to extract the inflection point from the edges; next, the leaf tooth apex is extracted by screening the convex points; then, according to the definition of the leaf structure, the characteristics of the leaf teeth are described and measured in terms of number of orders of teeth, tooth spacing, number of teeth, sinus shape, and tooth shape. In this manner, data extracted from the algorithm can not only be used to classify plants, but also provide scientific and standardized data to understand the history of plant evolution. Finally, to verify the effectiveness of the extraction method, we used simple linear discriminant analysis and multiclass support vector machine to classify leaves. The results show that the proposed method achieves high accuracy that is superior to that of other methods.

]]>
<![CDATA[RaCaT: An open source and easy to use radiomics calculator tool]]> https://www.researchpad.co/article/5c76fe64d5eed0c484e5b9d0

Purpose

The widely known field ‘Radiomics’ aims to provide an extensive image based phenotyping of e.g. tumors using a wide variety of feature values extracted from medical images. Therefore, it is of utmost importance that feature values calculated by different institutes follow the same feature definitions. For this purpose, the imaging biomarker standardization initiative (IBSI) provides detailed mathematical feature descriptions, as well as (mathematical) test phantoms and corresponding reference feature values. We present here an easy to use radiomic feature calculator, RaCaT, which provides the calculation of a large number of radiomic features for all kind of medical images which are in compliance with the standard.

Methods

The calculator is implemented in C++ and comes as a standalone executable. Therefore, it can be easily integrated in any programming language, but can also be called from the command line. No programming skills are required to use the calculator. The software architecture is highly modularized so that it is easily extendible. The user can also download the source code, adapt it if needed and build the calculator from source. The calculated feature values are compliant with the ones provided by the IBSI standard. Source code, example files for the software configuration, and documentation can be found online on GitHub (https://github.com/ellipfaehlerUMCG/RaCat).

Results

The comparison with the standard values shows that all calculated features as well as image preprocessing steps, comply with the IBSI standard. The performance is also demonstrated on clinical examples.

Conclusions

The authors successfully implemented an easy to use Radiomics calculator that can be called from any programming language or from the command line. Image preprocessing and feature settings and calculations can be adjusted by the user.

]]>
<![CDATA[Automatic microarray image segmentation with clustering-based algorithms]]> https://www.researchpad.co/article/5c50c44bd5eed0c4845e8467

Image segmentation, as a key step of microarray image processing, is crucial for obtaining the spot expressions simultaneously. However, state-of-art clustering-based segmentation algorithms are sensitive to noises. To solve this problem and improve the segmentation accuracy, in this article, several improvements are introduced into the fast and simple clustering methods (K-means and Fuzzy C means). Firstly, a contrast enhancement algorithm is implemented in image preprocessing to improve the gridding precision. Secondly, the data-driven means are proposed for cluster center initialization, instead of usual random setting. The third improvement is that the multi features, including intensity features, spatial features, and shape features, are implemented in feature selection to replace the sole pixel intensity feature used in the traditional clustering-based methods to avoid taking noises as spot pixels. Moreover, the principal component analysis is adopted for various feature extraction. Finally, an adaptive adjustment algorithm is proposed based on data mining and learning for further dealing with the missing spots or low contrast spots. Experiments on real and simulation data sets indicate that the proposed improvements made our proposed method obtains higher segmented precision than the traditional K-means and Fuzzy C means clustering methods.

]]>
<![CDATA[Efficient algorithms for Longest Common Subsequence of two bucket orders to speed up pairwise genetic map comparison]]> https://www.researchpad.co/article/5c2e7ff1d5eed0c48451c5d0

Genetic maps order genetic markers along chromosomes. They are, for instance, extensively used in marker-assisted selection to accelerate breeding programs. Even for the same species, people often have to deal with several alternative maps obtained using different ordering methods or different datasets, e.g. resulting from different segregating populations. Having efficient tools to identify the consistency and discrepancy of alternative maps is thus essential to facilitate genetic map comparisons. We propose to encode genetic maps by bucket order, a kind of order, which takes into account the blurred parts of the marker order while being an efficient data structure to achieve low complexity algorithms. The main result of this paper is an O(n log(n)) procedure to identify the largest agreements between two bucket orders of n elements, their Longest Common Subsequence (LCS), providing an efficient solution to highlight discrepancies between two genetic maps. The LCS of two maps, being the largest set of their collinear markers, is used as a building block to compute pairwise map congruence, to visually emphasize maker collinearity and in some scaffolding methods relying on genetic maps to improve genome assembly. As the LCS computation is a key subroutine of all these genetic map related tools, replacing the current LCS subroutine of those methods by ours –to do the exact same work but faster– could significantly speed up those methods without changing their accuracy. To ease such transition we provide all required algorithmic details in this self contained paper as well as an R package implementing them, named LCSLCIS, which is freely available at: https://github.com/holtzy/LCSLCIS.

]]>
<![CDATA[A software tool for the quantification of metastatic colony growth dynamics and size distributions in vitro and in vivo]]> https://www.researchpad.co/article/5c2e7fd5d5eed0c48451b9a6

The majority of cancer-related deaths are due to metastasis, hence improved methods to biologically and computationally model metastasis are required. Computational models rely on robust data that is machine-readable. The current methods used to model metastasis in mice involve generating primary tumors by injecting human cells into immune-compromised mice, or by examining genetically engineered mice that are pre-disposed to tumor development and that eventually metastasize. The degree of metastasis can be measured using flow cytometry, bioluminescence imaging, quantitative PCR, and/or by manually counting individual lesions from metastatic tissue sections. The aforementioned methods are time-consuming and do not provide information on size distribution or spatial localization of individual metastatic lesions. In this work, we describe and provide a MATLAB script for an image-processing based method designed to obtain quantitative data from tissue sections comprised of multiple subpopulations of disseminated cells localized at metastatic sites in vivo. We further show that this method can be easily adapted for high throughput imaging of live or fixed cells in vitro under a multitude of conditions in order to assess clonal fitness and evolution. The inherent variation in mouse studies, increasing complexity in experimental design which incorporate fate-mapping of individual cells, result in the need for a large cohort of mice to generate a robust dataset. High-throughput imaging techniques such as the one that we describe will enhance the data that can be used as input for the development of computational models aimed at modeling the metastatic process.

]]>
<![CDATA[Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: A retrospective cohort study]]> https://www.researchpad.co/article/5c06f031d5eed0c484c6d333

Background

The current acute kidney injury (AKI) risk prediction model for patients undergoing percutaneous coronary intervention (PCI) from the American College of Cardiology (ACC) National Cardiovascular Data Registry (NCDR) employed regression techniques. This study aimed to evaluate whether models using machine learning techniques could significantly improve AKI risk prediction after PCI.

Methods and findings

We used the same cohort and candidate variables used to develop the current NCDR CathPCI Registry AKI model, including 947,091 patients who underwent PCI procedures between June 1, 2009, and June 30, 2011. The mean age of these patients was 64.8 years, and 32.8% were women, with a total of 69,826 (7.4%) AKI events. We replicated the current AKI model as the baseline model and compared it with a series of new models. Temporal validation was performed using data from 970,869 patients undergoing PCIs between July 1, 2016, and March 31, 2017, with a mean age of 65.7 years; 31.9% were women, and 72,954 (7.5%) had AKI events. Each model was derived by implementing one of two strategies for preprocessing candidate variables (preselecting and transforming candidate variables or using all candidate variables in their original forms), one of three variable-selection methods (stepwise backward selection, lasso regularization, or permutation-based selection), and one of two methods to model the relationship between variables and outcome (logistic regression or gradient descent boosting). The cohort was divided into different training (70%) and test (30%) sets using 100 different random splits, and the performance of the models was evaluated internally in the test sets. The best model, according to the internal evaluation, was derived by using all available candidate variables in their original form, permutation-based variable selection, and gradient descent boosting. Compared with the baseline model that uses 11 variables, the best model used 13 variables and achieved a significantly better area under the receiver operating characteristic curve (AUC) of 0.752 (95% confidence interval [CI] 0.749–0.754) versus 0.711 (95% CI 0.708–0.714), a significantly better Brier score of 0.0617 (95% CI 0.0615–0.0618) versus 0.0636 (95% CI 0.0634–0.0638), and a better calibration slope of observed versus predicted rate of 1.008 (95% CI 0.988–1.028) versus 1.036 (95% CI 1.015–1.056). The best model also had a significantly wider predictive range (25.3% versus 21.6%, p < 0.001) and was more accurate in stratifying AKI risk for patients. Evaluated on a more contemporary CathPCI cohort (July 1, 2015–March 31, 2017), the best model consistently achieved significantly better performance than the baseline model in AUC (0.785 versus 0.753), Brier score (0.0610 versus 0.0627), calibration slope (1.003 versus 1.062), and predictive range (29.4% versus 26.2%). The current study does not address implementation for risk calculation at the point of care, and potential challenges include the availability and accessibility of the predictors.

Conclusions

Machine learning techniques and data-driven approaches resulted in improved prediction of AKI risk after PCI. The results support the potential of these techniques for improving risk prediction models and identification of patients who may benefit from risk-mitigation strategies.

]]>
<![CDATA[Scalable preprocessing of high volume environmental acoustic data for bioacoustic monitoring]]> https://www.researchpad.co/article/5b6dda12463d7e7491b405eb

In this work, we examine the problem of efficiently preprocessing and denoising high volume environmental acoustic data, which is a necessary step in many bird monitoring tasks. Preprocessing is typically made up of multiple steps which are considered separately from each other. These are often resource intensive, particularly because the volume of data involved is high. We focus on addressing two challenges within this problem: how to combine existing preprocessing tasks while maximising the effectiveness of each step, and how to process this pipeline quickly and efficiently, so that it can be used to process high volumes of acoustic data. We describe a distributed system designed specifically for this problem, utilising a master-slave model with data parallelisation. By investigating the impact of individual preprocessing tasks on each other, and their execution times, we determine an efficient and accurate order for preprocessing tasks within the distributed system. We find that, using a single core, our pipeline executes 1.40 times faster compared to manually executing all preprocessing tasks. We then apply our pipeline in the distributed system and evaluate its performance. We find that our system is capable of preprocessing bird acoustic recordings at a rate of 174.8 seconds of audio per second of real time with 32 cores over 8 virtual machines, which is 21.76 times faster than a serial process.

]]>
<![CDATA[Attraction Propagation: A User-Friendly Interactive Approach for Polyp Segmentation in Colonoscopy Images]]> https://www.researchpad.co/article/5989da25ab0ee8fa60b80677

The article raised a user-friendly interactive approach-Attraction Propagation (AP) in segmentation of colorectal polyps. Compared with other interactive approaches, the AP relied on only one foreground seed to get different shapes of polyps, and it can be compatible with pre-processing stage of Computer-Aided Diagnosis (CAD) under the systematically procedure of Optical Colonoscopy (OC). The experimental design was based on challenging distinct datasets that totally includes 1691 OC images, and the results demonstrated that no matter in accuracy or calculating speed, the AP performed better than the state-of-the-art.

]]>
<![CDATA[An investigation of emotion dynamics in major depressive disorder patients and healthy persons using sparse longitudinal networks]]> https://www.researchpad.co/article/5989db5cab0ee8fa60be018a

Background

Differences in within-person emotion dynamics may be an important source of heterogeneity in depression. To investigate these dynamics, researchers have previously combined multilevel regression analyses with network representations. However, sparse network methods, specifically developed for longitudinal network analyses, have not been applied. Therefore, this study used this approach to investigate population-level and individual-level emotion dynamics in healthy and depressed persons and compared this method with the multilevel approach.

Methods

Time-series data were collected in pair-matched healthy persons and major depressive disorder (MDD) patients (n = 54). Seven positive affect (PA) and seven negative affect (NA) items were administered electronically at 90 times (30 days; thrice per day). The population-level (healthy vs. MDD) and individual-level time series were analyzed using a sparse longitudinal network model based on vector autoregression. The population-level model was also estimated with a multilevel approach. Effects of different preprocessing steps were evaluated as well. The characteristics of the longitudinal networks were investigated to gain insight into the emotion dynamics.

Results

In the population-level networks, longitudinal network connectivity was strongest in the healthy group, with nodes showing more and stronger longitudinal associations with each other. Individually estimated networks varied strongly across individuals. Individual variations in network connectivity were unrelated to baseline characteristics (depression status, neuroticism, severity). A multilevel approach applied to the same data showed higher connectivity in the MDD group, which seemed partly related to the preprocessing approach.

Conclusions

The sparse network approach can be useful for the estimation of networks with multiple nodes, where overparameterization is an issue, and for individual-level networks. However, its current inability to model random effects makes it less useful as a population-level approach in case of large heterogeneity. Different preprocessing strategies appeared to strongly influence the results, complicating inferences about network density.

]]>
<![CDATA[Deformable registration of 3D ultrasound volumes using automatic landmark generation]]> https://www.researchpad.co/article/5c95523ed5eed0c4846f322e

US image registration is an important task e.g. in Computer Aided Surgery. Due to tissue deformation occurring between pre-operative and interventional images often deformable registration is necessary. We present a registration method focused on surface structures (i.e. saliencies) of soft tissues like organ capsules or vessels. The main concept follows the idea of representative landmarks (so called leading points). These landmarks represent saliencies in each image in a certain region of interest. The determination of deformation was based on a geometric model assuming that saliencies could locally be described by planes. These planes were calculated from the landmarks using two dimensional linear regression. Once corresponding regions in both images were found, a displacement vector field representing the local deformation was computed. Finally, the deformed image was warped to match the pre-operative image. For error calculation we used a phantom representing the urinary bladder and the prostate. The phantom could be deformed to mimic tissue deformation. Error calculation was done using corresponding landmarks in both images. The resulting target registration error of this procedure amounted to 1.63 mm. With respect to patient data a full deformable registration was performed on two 3D-US images of the abdomen. The resulting mean distance error was 2.10 ± 0.66 mm compared to an error of 2.75 ± 0.57 mm from a simple rigid registration. A two-sided paired t-test showed a p-value < 0.001. We conclude that the method improves the results of the rigid registration considerably. Provided an appropriate choice of the filter there are many possible fields of applications.

]]>
<![CDATA[Training in High-Throughput Sequencing: Common Guidelines to Enable Material Sharing, Dissemination, and Reusability]]> https://www.researchpad.co/article/5989da38ab0ee8fa60b86daa

The advancement of high-throughput sequencing (HTS) technologies and the rapid development of numerous analysis algorithms and pipelines in this field has resulted in an unprecedentedly high demand for training scientists in HTS data analysis. Embarking on developing new training materials is challenging for many reasons. Trainers often do not have prior experience in preparing or delivering such materials and struggle to keep them up to date. A repository of curated HTS training materials would support trainers in materials preparation, reduce the duplication of effort by increasing the usage of existing materials, and allow for the sharing of teaching experience among the HTS trainers’ community. To achieve this, we have developed a strategy for materials’ curation and dissemination. Standards for describing training materials have been proposed and applied to the curation of existing materials. A Git repository has been set up for sharing annotated materials that can now be reused, modified, or incorporated into new courses. This repository uses Git; hence, it is decentralized and self-managed by the community and can be forked/built-upon by all users. The repository is accessible at http://bioinformatics.upsc.se/htmr.

]]>
<![CDATA[Preconditioning 2D Integer Data for Fast Convex Hull Computations]]> https://www.researchpad.co/article/5989db37ab0ee8fa60bd37d5

In order to accelerate computing the convex hull on a set of n points, a heuristic procedure is often applied to reduce the number of points to a set of s points, sn, which also contains the same hull. We present an algorithm to precondition 2D data with integer coordinates bounded by a box of size p × q before building a 2D convex hull, with three distinct advantages. First, we prove that under the condition min(p, q) ≤ n the algorithm executes in time within O(n); second, no explicit sorting of data is required; and third, the reduced set of s points forms a simple polygonal chain and thus can be directly pipelined into an O(n) time convex hull algorithm. This paper empirically evaluates and quantifies the speed up gained by preconditioning a set of points by a method based on the proposed algorithm before using common convex hull algorithms to build the final hull. A speedup factor of at least four is consistently found from experiments on various datasets when the condition min(p, q) ≤ n holds; the smaller the ratio min(p, q)/n is in the dataset, the greater the speedup factor achieved.

]]>
<![CDATA[Identifying Topics in Microblogs Using Wikipedia]]> https://www.researchpad.co/article/5989da32ab0ee8fa60b84bd8

Twitter is an extremely high volume platform for user generated contributions regarding any topic. The wealth of content created at real-time in massive quantities calls for automated approaches to identify the topics of the contributions. Such topics can be utilized in numerous ways, such as public opinion mining, marketing, entertainment, and disaster management. Towards this end, approaches to relate single or partial posts to knowledge base items have been proposed. However, in microblogging systems like Twitter, topics emerge from the culmination of a large number of contributions. Therefore, identifying topics based on collections of posts, where individual posts contribute to some aspect of the greater topic is necessary. Models, such as Latent Dirichlet Allocation (LDA), propose algorithms for relating collections of posts to sets of keywords that represent underlying topics. In these approaches, figuring out what the specific topic(s) the keyword sets represent remains as a separate task. Another issue in topic detection is the scope, which is often limited to specific domain, such as health. This work proposes an approach for identifying domain-independent specific topics related to sets of posts. In this approach, individual posts are processed and then aggregated to identify key tokens, which are then mapped to specific topics. Wikipedia article titles are selected to represent topics, since they are up to date, user-generated, sophisticated articles that span topics of human interest. This paper describes the proposed approach, a prototype implementation, and a case study based on data gathered during the heavily contributed periods corresponding to the four US election debates in 2012. The manually evaluated results (0.96 precision) and other observations from the study are discussed in detail.

]]>
<![CDATA[Spatio-Temporal Metabolite Profiling of the Barley Germination Process by MALDI MS Imaging]]> https://www.researchpad.co/article/5989d9faab0ee8fa60b71b0b

MALDI mass spectrometry imaging was performed to localize metabolites during the first seven days of the barley germination. Up to 100 mass signals were detected of which 85 signals were identified as 48 different metabolites with highly tissue-specific localizations. Oligosaccharides were observed in the endosperm and in parts of the developed embryo. Lipids in the endosperm co-localized in dependency on their fatty acid compositions with changes in the distributions of diacyl phosphatidylcholines during germination. 26 potentially antifungal hordatines were detected in the embryo with tissue-specific localizations of their glycosylated, hydroxylated, and O-methylated derivates. In order to reveal spatio-temporal patterns in local metabolite compositions, multiple MSI data sets from a time series were analyzed in one batch. This requires a new preprocessing strategy to achieve comparability between data sets as well as a new strategy for unsupervised clustering. The resulting spatial segmentation for each time point sample is visualized in an interactive cluster map and enables simultaneous interactive exploration of all time points. Using this new analysis approach and visualization tool germination-dependent developments of metabolite patterns with single MS position accuracy were discovered. This is the first study that presents metabolite profiling of a cereals’ germination process over time by MALDI MSI with the identification of a large number of peaks of agronomically and industrially important compounds such as oligosaccharides, lipids and antifungal agents. Their detailed localization as well as the MS cluster analyses for on-tissue metabolite profile mapping revealed important information for the understanding of the germination process, which is of high scientific interest.

]]>
<![CDATA[A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data]]> https://www.researchpad.co/article/5989dab5ab0ee8fa60bac790

Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

]]>
<![CDATA[SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data]]> https://www.researchpad.co/article/5989dae4ab0ee8fa60bbcc51

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

]]>