ResearchPad - applications-notes https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[RiboFlow, RiboR and RiboPy: an ecosystem for analyzing ribosome profiling data at read length resolution]]> https://www.researchpad.co/article/N1e190294-f8f0-4e62-8bfe-9beb97a99364 Ribosome occupancy measurements enable protein abundance estimation and infer mechanisms of translation. Recent studies have revealed that sequence read lengths in ribosome profiling data are highly variable and carry critical information. Consequently, data analyses require the computation and storage of multiple metrics for a wide range of ribosome footprint lengths. We developed a software ecosystem including a new efficient binary file format named ‘ribo’. Ribo files store all essential data grouped by ribosome footprint lengths. Users can assemble ribo files using our RiboFlow pipeline that processes raw ribosomal profiling sequencing data. RiboFlow is highly portable and customizable across a large number of computational environments with built-in capabilities for parallelization. We also developed interfaces for writing and reading ribo files in the R (RiboR) and Python (RiboPy) environments. Using RiboR and RiboPy, users can efficiently access ribosome profiling quality control metrics, generate essential plots and carry out analyses. Altogether, these components create a software ecosystem for researchers to study translation through ribosome profiling.Availability and implementationFor a quickstart, please see https://ribosomeprofiling.github.io. Source code, installation instructions and links to documentation are available on GitHub: https://github.com/ribosomeprofiling.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[3D-Cell-Annotator: an open-source active surface tool for single-cell segmentation in 3D microscopy images]]> https://www.researchpad.co/article/N3373eae5-e76e-493f-9197-c6f496095c01 Segmentation of single cells in microscopy images is one of the major challenges in computational biology. It is the first step of most bioimage analysis tasks, and essential to create training sets for more advanced deep learning approaches. Here, we propose 3D-Cell-Annotator to solve this task using 3D active surfaces together with shape descriptors as prior information in a semi-automated fashion. The software uses the convenient 3D interface of the widely used Medical Imaging Interaction Toolkit (MITK). Results on 3D biological structures (e.g. spheroids, organoids and embryos) show that the precision of the segmentation reaches the level of a human expert.Availability and implementation3D-Cell-Annotator is implemented in CUDA/C++ as a patch for the segmentation module of MITK. The 3D-Cell-Annotator enabled MITK distribution can be downloaded at: www.3D-cell-annotator.org. It works under Windows 64-bit systems and recent Linux distributions even on a consumer level laptop with a CUDA-enabled video card using recent NVIDIA drivers.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[The open targets post-GWAS analysis pipeline]]> https://www.researchpad.co/article/Na8d251ed-6620-4a18-bb78-564e7e8d3f79 Genome-wide association studies (GWAS) are a powerful method to detect even weak associations between variants and phenotypes; however, many of the identified associated variants are in non-coding regions, and presumably influence gene expression regulation. Identifying potential drug targets, i.e. causal protein-coding genes, therefore, requires crossing the genetics results with functional data.ResultsWe present a novel data integration pipeline that analyses GWAS results in the light of experimental epigenetic and cis-regulatory datasets, such as ChIP-Seq, Promoter-Capture Hi-C or eQTL, and presents them in a single report, which can be used for inferring likely causal genes. This pipeline was then fed into an interactive data resource.Availability and implementationThe analysis code is available at www.github.com/Ensembl/postgap and the interactive data browser at postgwas.opentargets.io. ]]> <![CDATA[PISA-SPARKY: an interactive SPARKY plugin to analyze oriented solid-state NMR spectra of helical membrane proteins]]> https://www.researchpad.co/article/N981a32bd-a37b-4315-9117-3eabfe7b2b1c Two-dimensional [15N-1H] separated local field solid-state nuclear magnetic resonance (NMR) experiments of membrane proteins aligned in lipid bilayers provide tilt and rotation angles for α-helical segments using Polar Index Slant Angle (PISA)-wheel models. No integrated software has been made available for data analysis and visualization.ResultsWe have developed the PISA-SPARKY plugin to seamlessly integrate PISA-wheel modeling into the NMRFAM-SPARKY platform. The plugin performs basic simulations, exhaustive fitting against experimental spectra, error analysis and dipolar and chemical shift wave plotting. The plugin also supports PyMOL integration and handling of parameters that describe variable alignment and dynamic scaling encountered with magnetically aligned media, ensuring optimal fitting and generation of restraints for structure calculation.Availability and implementation PISA-SPARKY is freely available in the latest version of NMRFAM-SPARKY from the National Magnetic Resonance Facility at Madison (http://pine.nmrfam.wisc.edu/download_packages.html), the NMRbox Project (https://nmrbox.org) and to subscribers of the SBGrid (https://sbgrid.org). The pisa.py script is available and documented on GitHub (https://github.com/weberdak/pisa.py) along with a tutorial video and sample data.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[Identifying and removing haplotypic duplication in primary genome assemblies]]> https://www.researchpad.co/article/Ne6d65ccc-49b2-4db7-a8a2-89a52f6f955b Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.Availability and implementationThe source code is written in C and is available at https://github.com/dfguan/purge_dups.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[CytoSeg 2.0: automated extraction of actin filaments]]> https://www.researchpad.co/article/N6afbaa1f-2cf1-43b9-ad3f-012a20bb56e4 Actin filaments (AFs) are dynamic structures that substantially change their organization over time. The dynamic behavior and the relatively low signal-to-noise ratio during live-cell imaging have rendered the quantification of the actin organization a difficult task.ResultsWe developed an automated image-based framework that extracts AFs from fluorescence microscopy images and represents them as networks, which are automatically analyzed to identify and compare biologically relevant features. Although the source code is freely available, we have now implemented the framework into a graphical user interface that can be installed as a Fiji plugin, thus enabling easy access by the research community.Availability and implementationCytoSeg 2.0 is open-source software under the GPL and is available on Github: https://github.com/jnowak90/CytoSeg2.0.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[GSOAP: a tool for visualization of gene set over-representation analysis]]> https://www.researchpad.co/article/N16a33cc8-828a-4a96-aeb8-a489b2870d93 Gene sets over-representation analysis (GSOA) is a common technique of enrichment analysis that measures the overlap between a gene set and selected instances (e.g. pathways). Despite its popularity, there is currently no established standard for visualization of GSOA results.ResultsHere, we propose a visual exploration of the GSOA results by showing the relationships among the enriched instances, while highlighting important instance attributes, such as significance, closeness (centrality) and clustering.Availability and implementationGSOAP is implemented as an R package and is available at https://github.com/tomastokar/gsoap. ]]> <![CDATA[Defining data-driven primary transcript annotations with <i>primaryTranscriptAnnotation</i> in R]]> https://www.researchpad.co/article/N46ca6388-4785-4186-a69b-e5d52a91f458 Nascent transcript measurements derived from run-on sequencing experiments are critical for the investigation of transcriptional mechanisms and regulatory networks. However, conventional mRNA gene annotations significantly differ from the boundaries of primary transcripts. New primary transcript annotations are needed to accurately interpret run-on data. We developed the primaryTranscriptAnnotation R package to infer the transcriptional start and termination sites of primary transcripts from genomic run-on data. We then used these inferred coordinates to annotate transcriptional units identified de novo. This package provides the novel utility to integrate data-driven primary transcript annotations with transcriptional unit coordinates identified in an unbiased manner. Highlighting the importance of using accurate primary transcript coordinates, we demonstrate that this new methodology increases the detection of differentially expressed transcripts and provides more accurate quantification of RNA polymerase pause indices.Availability and implementation https://github.com/WarrenDavidAnderson/genomicsRpackage/tree/master/primaryTranscriptAnnotation.Supplementary information Supplementary data are available at Bioinformatics online. ]]> <![CDATA[ChemBioServer 2.0: an advanced web server for filtering, clustering and networking of chemical compounds facilitating both drug discovery and repurposing]]> https://www.researchpad.co/article/N9a5ff660-160e-4563-8b2c-4c93304d161f

Abstract

Summary

ChemBioServer 2.0 is the advanced sequel of a web server for filtering, clustering and networking of chemical compound libraries facilitating both drug discovery and repurposing. It provides researchers the ability to (i) browse and visualize compounds along with their physicochemical and toxicity properties, (ii) perform property-based filtering of compounds, (iii) explore compound libraries for lead optimization based on perfect match substructure search, (iv) re-rank virtual screening results to achieve selectivity for a protein of interest against different protein members of the same family, selecting only those compounds that score high for the protein of interest, (v) perform clustering among the compounds based on their physicochemical properties providing representative compounds for each cluster, (vi) construct and visualize a structural similarity network of compounds providing a set of network analysis metrics, (vii) combine a given set of compounds with a reference set of compounds into a single structural similarity network providing the opportunity to infer drug repurposing due to transitivity, (viii) remove compounds from a network based on their similarity with unwanted substances (e.g. failed drugs) and (ix) build custom compound mining pipelines.

Availability and implementation

http://chembioserver.vi-seem.eu.

]]>
<![CDATA[MemBlob database and server for identifying transmembrane regions using cryo-EM maps]]> https://www.researchpad.co/article/N98e3d69d-d9a9-4eff-b09e-c404504024c0

Abstract

Summary

The identification of transmembrane helices in transmembrane proteins is crucial, not only to understand their mechanism of action but also to develop new therapies. While experimental data on the boundaries of membrane-embedded regions are sparse, this information is present in cryo-electron microscopy (cryo-EM) density maps and it has not been utilized yet for determining membrane regions. We developed a computational pipeline, where the inputs of a cryo-EM map, the corresponding atomistic structure, and the potential bilayer orientation determined by TMDET algorithm of a given protein result in an output defining the residues assigned to the bulk water phase, lipid interface and the lipid hydrophobic core. Based on this method, we built a database involving published cryo-EM protein structures and a server to be able to compute this data for newly obtained structures.

Availability and implementation

http://memblob.hegelab.org.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[AQUA-DUCT 1.0: structural and functional analysis of macromolecules from an intramolecular voids perspective]]> https://www.researchpad.co/article/N82a69689-893c-458c-9812-8a075bdd4bc6

Abstract

Motivation

Tunnels, pores, channels, pockets and cavities contribute to proteins architecture and performance. However, analysis and characteristics of transportation pathways and internal binding cavities are performed separately. We aimed to provide universal tool for analysis of proteins integral interior with access to detailed information on the ligands transportation phenomena and binding preferences.

Results

AQUA-DUCT version 1.0 is a comprehensive method for macromolecules analysis from the intramolecular voids perspective using small ligands as molecular probes. This version gives insight into several properties of macromolecules and facilitates protein engineering and drug design by the combination of the tracking and local mapping approach to small ligands.

Availability and implementation

http://www.aquaduct.pl.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[ iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics]]> https://www.researchpad.co/article/N78740469-04a8-4a39-867e-5d828066351f

Abstract

Summary

We present an R package called iq to enable accurate protein quantification for label-free data-independent acquisition (DIA) mass spectrometry-based proteomics, a recently developed global approach with superior quantitative consistency. We implement the popular maximal peptide ratio extraction module of the MaxLFQ algorithm, so far only applicable to data-dependent acquisition mode using the software suite MaxQuant. Moreover, our implementation shows, for each protein separately, the validity of quantification over all samples. Hence, iq exports a state-of-the-art protein quantification algorithm to the emerging DIA mode in an open-source implementation.

Availability and implementation

The open-source R package is available on CRAN, https://github.com/tvpham/iq/releases and oncoproteomics.nl/iq.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[DeepSimulator1.5: a more powerful, quicker and lighter simulator for Nanopore sequencing]]> https://www.researchpad.co/article/N6db37b2d-5ec1-4e21-86d3-e38530a5d172

Abstract

Motivation

Nanopore sequencing is one of the leading third-generation sequencing technologies. A number of computational tools have been developed to facilitate the processing and analysis of the Nanopore data. Previously, we have developed DeepSimulator1.0 (DS1.0), which is the first simulator for Nanopore sequencing to produce both the raw electrical signals and the reads. However, although DS1.0 can produce high-quality reads, for some sequences, the divergence between the simulated raw signals and the real signals can be large. Furthermore, the Nanopore sequencing technology has evolved greatly since DS1.0 was released. It is thus necessary to update DS1.0 to accommodate those changes.

Results

We propose DeepSimulator1.5 (DS1.5), all three modules of which have been updated substantially from DS1.0. As for the sequence generator, we updated the sample read length distribution to reflect the newest real reads’ features. In terms of the signal generator, which is the core of DeepSimulator, we added one more pore model, the context-independent pore model, which is much faster than the previous context-dependent one. Furthermore, to make the generated signals more similar to the real ones, we added a low-pass filter to post-process the pore model signals. Regarding the basecaller, we added the support for the newest official basecaller, Guppy, which can support both GPU and CPU. In addition, multiple optimizations, related to multiprocessing control, memory and storage management, have been implemented to make DS1.5 a much more amenable and lighter simulator than DS1.0.

Availability and implementation

The main program and the data are available at https://github.com/lykaust15/DeepSimulator.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[HaploTypo: a variant-calling pipeline for phased genomes]]> https://www.researchpad.co/article/Ne9994da7-7eb8-4650-a8d3-eeb573b56dbe

Abstract

Summary

An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome.

Availability and implementation

HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[Pepitope: epitope mapping from affinity-selected peptides]]> https://www.researchpad.co/article/N4fc36ce9-2500-4c0b-83ff-a07ad5ec1216

Abstract

Identifying the epitope to which an antibody binds is central for many immunological applications such as drug design and vaccine development. The Pepitope server is a web-based tool that aims at predicting discontinuous epitopes based on a set of peptides that were affinity-selected against a monoclonal antibody of interest. The server implements three different algorithms for epitope mapping: PepSurf, Mapitope, and a combination of the two. The rationale behind these algorithms is that the set of peptides mimics the genuine epitope in terms of physicochemical properties and spatial organization. When the three-dimensional (3D) structure of the antigen is known, the information in these peptides can be used to computationally infer the corresponding epitope. A user-friendly web interface and a graphical tool that allows viewing the predicted epitopes were developed. Pepitope can also be applied for inferring other types of protein–protein interactions beyond the immunological context, and as a general tool for aligning linear sequences to a 3D structure.

Availability: http://pepitope.tau.ac.il/

Contact: talp@post.tau.ac.il

]]>
<![CDATA[Assembling millions of short DNA sequences using SSAKE]]> https://www.researchpad.co/article/Nb4ae0836-d49b-4488-bf23-eb7fd3bacbf0

Abstract

Summary: Novel DNA sequencing technologies with the potential for up to three orders magnitude more sequence throughput than conventional Sanger sequencing are emerging. The instrument now available from Solexa Ltd, produces millions of short DNA sequences of 25 nt each. Due to ubiquitous repeats in large genomes and the inability of short sequences to uniquely and unambiguously characterize them, the short read length limits applicability for de novo sequencing. However, given the sequencing depth and the throughput of this instrument, stringent assembly of highly identical sequences can be achieved. We describe SSAKE, a tool for aggressively assembling millions of short nucleotide sequences by progressively searching through a prefix tree for the longest possible overlap between any two sequences. SSAKE is designed to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets.

Availability:

Contact: rwarren@bcgsc.ca

]]>
<![CDATA[Colorstock, SScolor, Ratón: RNA alignment visualization tools]]> https://www.researchpad.co/article/N75e264de-d886-4c2f-b7a6-61e8d23be820

Abstract

Summary: Interactive examination of RNA multiple alignments for covariant mutations is a useful step in non-coding RNA sequence analysis. We present three parallel implementations of an RNA visualization metaphor: Colorstock, a command-line script using ANSI terminal color; SScolor, a Perl script that generates static HTML pages; and Ratón, an AJAX web application generating dynamic HTML. Each tool can be used to color RNA alignments by secondary structure and to visually highlight compensatory mutations in stems.

Availability: All source code is freely available under the GPL. The source code can be downloaded and a prototype of Ratón can be accessed at http://biowiki.org/RnaAlignmentViewers

Contact: ihh@berkeley.edu

]]>
<![CDATA[RepeatCraft: a meta-pipeline for repetitive element de-fragmentation and annotation]]> https://www.researchpad.co/article/5c9e595cd5eed0c484242cf6

Abstract

Summary

Repetitive elements comprise large proportion of many genomes. They have impact on both genome evolution and regulation. Their classification and the study of evolutionary history is a major emerging field. Various software exist to-date to classify and map repeats across genomes. The major unresolved drawback, however, is the fragmented nature of many identified repeat loci. This ultimately makes the classification of novel repeats and their evolutionary analyses difficult. To improve on this, we developed a pipeline (RepeatCraft) that integrates results from several repeat element classification tools based on both sequence similarity and structural features. The pipeline de-fragments closely spaced repeat loci in the genomes, reconstructing longer copies, thus allowing for a better annotation and sequence comparisons. The pipeline also includes a user interface that can run in a web browser allowing for an easy access and exploration of the repeat data.

Availability and implementation

RepeatCraft is implemented in Python and the web application is implemented in R. Download and documentation is freely available at https://github.com/niccw/repeatCraftp.

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[FastSpar: rapid and scalable correlation estimation for compositional data]]> https://www.researchpad.co/article/5c9e593ed5eed0c484242985

Abstract

Summary

A common goal of microbiome studies is the elucidation of community composition and member interactions using counts of taxonomic units extracted from sequence data. Inference of interaction networks from sparse and compositional data requires specialized statistical approaches. A popular solution is SparCC, however its performance limits the calculation of interaction networks for very high-dimensional datasets. Here we introduce FastSpar, an efficient and parallelizable implementation of the SparCC algorithm which rapidly infers correlation networks and calculates P-values using an unbiased estimator. We further demonstrate that FastSpar reduces network inference wall time by 2–3 orders of magnitude compared to SparCC.

Availability and implementation

FastSpar source code, precompiled binaries and platform packages are freely available on GitHub: github.com/scwatts/FastSpar

Supplementary information

Supplementary data are available at Bioinformatics online.

]]>
<![CDATA[WIlsON: Web-based Interactive Omics VisualizatioN]]> https://www.researchpad.co/article/5c9e5944d5eed0c484242a6e

Abstract

Motivation

High throughput (HT) screens in the omics field are typically analyzed by automated pipelines that generate static visualizations and comprehensive spreadsheet data for scientists. However, exploratory and hypothesis driven data analysis are key aspects of the understanding of biological systems, both generating extensive need for customized and dynamic visualization.

Results

Here we describe WIlsON, an interactive workbench for analysis and visualization of multi-omics data. It is primarily intended to empower screening platforms to offer access to pre-calculated HT screen results to the non-computational scientist. Facilitated by an open file format, WIlsON supports all types of omics screens, serves results via a web-based dashboard, and enables end users to perform analyses and generate publication-ready plots.

Availability and implementation

We implemented WIlsON in R with a focus on extensibility using the modular Shiny and Plotly frameworks. A demo of the interactive workbench without limitations may be accessed at http://loosolab.mpi-bn.mpg.de. A standalone Docker container as well as the source code of WIlsON are freely available from our Docker hub https://hub.docker. com/r/loosolab/wilson, CRAN https://cran.r-project.org/web/packages/wilson/, and GitHub repository https://github.molgen.mpg.de/loosolab/wilson-apps, respectively.

]]>