ResearchPad - Statistics and Probability https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Testing Mean Differences among Groups: Multivariate and Repeated Measures Analysis with Minimal Assumptions]]> https://www.researchpad.co/product?articleinfo=5b598600463d7e76cf8ed8a6

ABSTRACT

To date, there is a lack of satisfactory inferential techniques for the analysis of multivariate data in factorial designs, when only minimal assumptions on the data can be made. Presently available methods are limited to very particular study designs or assume either multivariate normality or equal covariance matrices across groups, or they do not allow for an assessment of the interaction effects across within-subjects and between-subjects variables. We propose and methodologically validate a parametric bootstrap approach that does not suffer from any of the above limitations, and thus provides a rather general and comprehensive methodological route to inference for multivariate and repeated measures data. As an example application, we consider data from two different Alzheimer’s disease (AD) examination modalities that may be used for precise and early diagnosis, namely, single-photon emission computed tomography (SPECT) and electroencephalogram (EEG). These data violate the assumptions of classical multivariate methods, and indeed classical methods would not have yielded the same conclusions with regards to some of the factors involved.

]]>
<![CDATA[High-resolution reconstruction of the United States human population distribution, 1790 to 2010]]> https://www.researchpad.co/product?articleinfo=5bff4203d5eed0c484aa23ca

Where do people live, and how has this changed over timescales of centuries? High-resolution spatial information on historical human population distribution is of great significance to understand human-environment interactions and their temporal dynamics. However, the complex relationship between population distribution and various influencing factors coupled with limited data availability make it a challenge to reconstruct human population distribution over timescales of centuries. This study generated 1-km decadal population maps for the conterminous US from 1790 to 2010 using parsimonious models based on natural suitability, socioeconomic desirability, and inhabitability. Five models of increasing complexity were evaluated. The models were validated with census tract and county subdivision population data in 2000 and were applied to generate five sets of 22 historical population maps from 1790–2010. Separating urban and rural areas and excluding non-inhabitable areas were the most important factors for improving the overall accuracy. The generated gridded population datasets and the production and validation methods are described here.

]]>
<![CDATA[Wide-field corneal subbasal nerve plexus mosaics in age-controlled healthy and type 2 diabetes populations]]> https://www.researchpad.co/product?articleinfo=5bff4207d5eed0c484aa2455

A dense nerve plexus in the clear outer window of the eye, the cornea, can be imaged in vivo to enable non-invasive monitoring of peripheral nerve degeneration in diabetes. However, a limited field of view of corneal nerves, operator-dependent image quality, and subjective image sampling methods have led to difficulty in establishing robust diagnostic measures relating to the progression of diabetes and its complications. Here, we use machine-based algorithms to provide wide-area mosaics of the cornea’s subbasal nerve plexus (SBP) also accounting for depth (axial) fluctuation of the plexus. Degradation of the SBP with age has been mitigated as a confounding factor by providing a dataset comprising healthy and type 2 diabetes subjects of the same age. To maximize reuse, the dataset includes bilateral eye data, associated clinical parameters, and machine-generated SBP nerve density values obtained through automatic segmentation and nerve tracing algorithms. The dataset can be used to examine nerve degradation patterns to develop tools to non-invasively monitor diabetes progression while avoiding narrow-field imaging and image selection biases.

]]>
<![CDATA[A mobile brain-body imaging dataset recorded during treadmill walking with a brain-computer interface]]> https://www.researchpad.co/product?articleinfo=5bff4205d5eed0c484aa2410

We present a mobile brain-body imaging (MoBI) dataset acquired during treadmill walking in a brain-computer interface (BCI) task. The data were collected from eight healthy subjects, each having three identical trials. Each trial consisted of three conditions: standing, treadmill walking, and treadmill walking with a closed-loop BCI. During the BCI condition, subjects used their brain activity to control a virtual avatar on a screen to walk in real-time. Robust procedures were designed to record lower limb joint angles (bilateral hip, knee, and ankle) using goniometers synchronized with 60-channel scalp electroencephalography (EEG). Additionally, electrooculogram (EOG), EEG electrodes impedance, and digitized EEG channel locations were acquired to aid artifact removal and EEG dipole-source localization. This dataset is unique in that it is the first published MoBI dataset recorded during walking. It is useful in addressing several important open research questions, such as how EEG is coupled with gait cycle during closed-loop BCI, how BCI influences neural activity during walking, and how a BCI decoder may be optimized.

]]>
<![CDATA[High-throughput density-functional perturbation theory phonons for inorganic materials]]> https://www.researchpad.co/product?articleinfo=5bff420dd5eed0c484aa2540

The knowledge of the vibrational properties of a material is of key importance to understand physical phenomena such as thermal conductivity, superconductivity, and ferroelectricity among others. However, detailed experimental phonon spectra are available only for a limited number of materials, which hinders the large-scale analysis of vibrational properties and their derived quantities. In this work, we perform ab initio calculations of the full phonon dispersion and vibrational density of states for 1521 semiconductor compounds in the harmonic approximation based on density functional perturbation theory. The data is collected along with derived dielectric and thermodynamic properties. We present the procedure used to obtain the results, the details of the provided database and a validation based on the comparison with experimental data.

]]>
<![CDATA[Novel sequences, structural variations and gene presence variations of Asian cultivated rice]]> https://www.researchpad.co/product?articleinfo=5bff420ed5eed0c484aa25aa

Genomic diversity within a species genome is the genetic basis of its phenotypic diversity essential for its adaptation to environments. The big picture of the total genetic diversity within Asian cultivated rice has been uncovered since the sequencing of 3,000 rice genomes, including the SNP data publicly available in the SNP-Seek database. Here we report other aspects of the genetic diversity, including rice sequences assembled from over 3,000 accessions but absent in the Nipponbare reference genome, structural variations (SVs) and gene presence/absence variations (PAVs) in 453 accessions with sequencing depth over 20x. Using either SVs or gene PAVs, we were able to reconstruct the population structure of O. sativa, which was consistent with previous result based on SNPs. Moreover, we demonstrated the usefulness of the new data sets by successfully detecting the strong association of the “Green Revolution gene”, sd1, with plant height. Our data provide a more comprehensive view of the genetic diversity within rice, as well as additional genomic resources for research in rice breeding and plant biology.

]]>
<![CDATA[DataTri, a database of American triatomine species occurrence]]> https://www.researchpad.co/product?articleinfo=5bff41fed5eed0c484aa22af

Trypanosoma cruzi, the causative agent of Chagas disease, is transmitted to mammals - including humans - by insect vectors of the subfamily Triatominae. We present the results of a compilation of triatomine occurrence and complementary ecological data that represents the most complete, integrated and updated database (DataTri) available on triatomine species at a continental scale. This database was assembled by collecting the records of triatomine species published from 1904 to 2017, spanning all American countries with triatomine presence. A total of 21815 georeferenced records were obtained from published literature, personal fieldwork and data provided by colleagues. The data compiled includes 24 American countries, 14 genera and 135 species. From a taxonomic perspective, 67.33% of the records correspond to the genus Triatoma, 20.81% to Panstrongylus, 9.01% to Rhodnius and the remaining 2.85% are distributed among the other 11 triatomine genera. We encourage using DataTri information in various areas, especially to improve knowledge of the geographical distribution of triatomine species and its variations in time.

]]>
<![CDATA[A Mediterranean coastal database for assessing the impacts of sea-level rise and associated hazards]]> https://www.researchpad.co/product?articleinfo=5b4cf84b463d7e12d26b018a

We have developed a new coastal database for the Mediterranean basin that is intended for coastal impact and adaptation assessment to sea-level rise and associated hazards on a regional scale. The data structure of the database relies on a linear representation of the coast with associated spatial assessment units. Using information on coastal morphology, human settlements and administrative boundaries, we have divided the Mediterranean coast into 13 900 coastal assessment units. To these units we have spatially attributed 160 parameters on the characteristics of the natural and socio-economic subsystems, such as extreme sea levels, vertical land movement and number of people exposed to sea-level rise and extreme sea levels. The database contains information on current conditions and on plausible future changes that are essential drivers for future impacts, such as sea-level rise rates and socio-economic development. Besides its intended use in risk and impact assessment, we anticipate that the Mediterranean Coastal Database (MCD) constitutes a useful source of information for a wide range of coastal applications.

]]>
<![CDATA[The Hamming Ball Sampler]]> https://www.researchpad.co/product?articleinfo=5bf6be03d5eed0c484d2b813

ABSTRACT

We introduce the Hamming ball sampler, a novel Markov chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models. Supplementary materials for this article are available online.

]]>
<![CDATA[YASARA View—molecular graphics for all devices—from smartphones to workstations]]> https://www.researchpad.co/product?articleinfo=5ba773f340307c2b1f5d481a

Summary: Today's graphics processing units (GPUs) compose the scene from individual triangles. As about 320 triangles are needed to approximate a single sphere—an atom—in a convincing way, visualizing larger proteins with atomic details requires tens of millions of triangles, far too many for smooth interactive frame rates. We describe a new approach to solve this ‘molecular graphics problem’, which shares the work between GPU and multiple CPU cores, generates high-quality results with perfectly round spheres, shadows and ambient lighting and requires only OpenGL 1.0 functionality, without any pixel shader Z-buffer access (a feature which is missing in most mobile devices).

Availability and implementation: YASARA View, a molecular modeling program built around the visualization algorithm described here, is freely available (including commercial use) for Linux, MacOS, Windows and Android (Intel) from www.YASARA.org.

Contact: elmar@yasara.org

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[ccSOL omics: a webserver for solubility prediction of endogenous and heterologous expression in Escherichia coli]]> https://www.researchpad.co/product?articleinfo=5ba773f140307c2b1f5d4819

Summary: Here we introduce ccSOL omics, a webserver for large-scale calculations of protein solubility. Our method allows (i) proteome-wide predictions; (ii) identification of soluble fragments within each sequences; (iii) exhaustive single-point mutation analysis.

Results: Using coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helix propensities, we built a predictor of protein solubility. Our approach shows an accuracy of 79% on the training set (36 990 Target Track entries). Validation on three independent sets indicates that ccSOL omics discriminates soluble and insoluble proteins with an accuracy of 74% on 31 760 proteins sharing <30% sequence similarity.

Availability and implementation: ccSOL omics can be freely accessed on the web at http://s.tartaglialab.com/page/ccsol_group. Documentation and tutorial are available at http://s.tartaglialab.com/static_files/shared/tutorial_ccsol_omics.html.

Contact: gian.tartaglia@crg.es

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[GATB: Genome Assembly & Analysis Tool Box]]> https://www.researchpad.co/product?articleinfo=5ba773ea40307c2b1f5d4816

Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints.

Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license.

Contact: lavenier@irisa.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[COSMOS: Python library for massively parallel workflows]]> https://www.researchpad.co/product?articleinfo=5ba773e840307c2b1f5d4815

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS)]]> https://www.researchpad.co/product?articleinfo=5ba773ec40307c2b1f5d4817

Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples.

Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools).

Contact: bammds-users@nongnu.org

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[RAPIDR: an analysis package for non-invasive prenatal testing of aneuploidy]]> https://www.researchpad.co/product?articleinfo=5ba773ef40307c2b1f5d4818

Non-invasive prenatal testing (NIPT) of fetal aneuploidy using cell-free fetal DNA is becoming part of routine clinical practice. RAPIDR (Reliable Accurate Prenatal non-Invasive Diagnosis R package) is an easy-to-use open-source R package that implements several published NIPT analysis methods. The input to RAPIDR is a set of sequence alignment files in the BAM format, and the outputs are calls for aneuploidy, including trisomies 13, 18, 21 and monosomy X as well as fetal sex. RAPIDR has been extensively tested with a large sample set as part of the RAPID project in the UK. The package contains quality control steps to make it robust for use in the clinical setting.

Availability and implementation: RAPIDR is implemented in R and can be freely downloaded via CRAN from here: http://cran.r-project.org/web/packages/RAPIDR/index.html.

Contact: kitty.lo@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Abrupt community transitions and cyclic evolutionary dynamics in complex food webs]]> https://www.researchpad.co/product?articleinfo=5ba0202140307c4e2b77ae2d

Understanding the emergence and maintenance of biodiversity ranks among the most fundamental challenges in evolutionary ecology. While processes of community assembly have frequently been analyzed from an ecological perspective, their evolutionary dimensions have so far received less attention. To elucidate the eco-evolutionary processes underlying the long-term build-up and potential collapse of community diversity, here we develop and examine an individual-based model describing coevolutionary dynamics driven by trophic interactions and interference competition, of a pair of quantitative traits determining predator and prey niches. Our results demonstrate the (1) emergence of communities with multiple trophic levels, shown here for the first time for stochastic models with linear functional responses, and (2) intermittent and cyclic evolutionary transitions between two alternative community states. In particular, our results indicate that the interplay of ecological and evolutionary dynamics often results in extinction cascades that remove the entire trophic level of consumers from a community. Finally, we show the (3) robustness of our results under variations of model assumptions, underscoring that processes of consumer collapse and subsequent rebound could be important elements of understanding biodiversity dynamics in natural communities.

]]>
<![CDATA[Variable Selection in Kernel Regression Using Measurement Error Selection Likelihoods]]> https://www.researchpad.co/product?articleinfo=5b5c02a3463d7e28a3e55d69 ]]> <![CDATA[Linear spline multilevel models for summarising childhood growth trajectories: A guide to their application using examples from five birth cohorts]]> https://www.researchpad.co/product?articleinfo=5b043693463d7e0e880ab6e1

Childhood growth is of interest in medical research concerned with determinants and consequences of variation from healthy growth and development. Linear spline multilevel modelling is a useful approach for deriving individual summary measures of growth, which overcomes several data issues (co-linearity of repeat measures, the requirement for all individuals to be measured at the same ages and bias due to missing data). Here, we outline the application of this methodology to model individual trajectories of length/height and weight, drawing on examples from five cohorts from different generations and different geographical regions with varying levels of economic development. We describe the unique features of the data within each cohort that have implications for the application of linear spline multilevel models, for example, differences in the density and inter-individual variation in measurement occasions, and multiple sources of measurement with varying measurement error. After providing example Stata syntax and a suggested workflow for the implementation of linear spline multilevel models, we conclude with a discussion of the advantages and disadvantages of the linear spline approach compared with other growth modelling methods such as fractional polynomials, more complex spline functions and other non-linear models.

]]>
<![CDATA[The spatiotemporal order of signaling events unveils the logic of development signaling]]> https://www.researchpad.co/product?articleinfo=5b00dd3a463d7e3c2d2a5071

Motivation: Animals from worms and insects to birds and mammals show distinct body plans; however, the embryonic development of diverse body plans with tissues and organs within is controlled by a surprisingly few signaling pathways. It is well recognized that combinatorial use of and dynamic interactions among signaling pathways follow specific logic to control complex and accurate developmental signaling and patterning, but it remains elusive what such logic is, or even, what it looks like.

Results: We have developed a computational model for Drosophila eye development with innovated methods to reveal how interactions among multiple pathways control the dynamically generated hexagonal array of R8 cells. We obtained two novel findings. First, the coupling between the long-range inductive signals produced by the proneural Hh signaling and the short-range restrictive signals produced by the antineural Notch and EGFR signaling is essential for generating accurately spaced R8s. Second, the spatiotemporal orders of key signaling events reveal a robust pattern of lateral inhibition conducted by Ato-coordinated Notch and EGFR signaling to collectively determine R8 patterning. This pattern, stipulating the orders of signaling and comparable to the protocols of communication, may help decipher the well-appreciated but poorly defined logic of developmental signaling.

Availability and implementation: The model is available upon request.

Contact: hao.zhu@ymail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Maximum type 1 error rate inflation in multiarmed clinical trials with adaptive interim sample size modifications]]> https://www.researchpad.co/product?articleinfo=5add677f463d7e355c484536

Sample size modifications in the interim analyses of an adaptive design can inflate the type 1 error rate, if test statistics and critical boundaries are used in the final analysis as if no modification had been made. While this is already true for designs with an overall change of the sample size in a balanced treatment-control comparison, the inflation can be much larger if in addition a modification of allocation ratios is allowed as well. In this paper, we investigate adaptive designs with several treatment arms compared to a single common control group. Regarding modifications, we consider treatment arm selection as well as modifications of overall sample size and allocation ratios. The inflation is quantified for two approaches: a naive procedure that ignores not only all modifications, but also the multiplicity issue arising from the many-to-one comparison, and a Dunnett procedure that ignores modifications, but adjusts for the initially started multiple treatments. The maximum inflation of the type 1 error rate for such types of design can be calculated by searching for the “worst case” scenarios, that are sample size adaptation rules in the interim analysis that lead to the largest conditional type 1 error rate in any point of the sample space. To show the most extreme inflation, we initially assume unconstrained second stage sample size modifications leading to a large inflation of the type 1 error rate. Furthermore, we investigate the inflation when putting constraints on the second stage sample sizes. It turns out that, for example fixing the sample size of the control group, leads to designs controlling the type 1 error rate.

]]>