ResearchPad - Information Systems and Management https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[PERSISTENT MISSION HOME DELIVERY IN IBADAN: ATTRACTIVE ROLE OF TRADITIONAL BIRTH ATTENDANTS]]> https://www.researchpad.co/product?articleinfo=5ad7d59c463d7e12f460f257

Background and objective:

One of the major factors responsible for high maternal and neonatal deaths in Nigeria and other developing countries is the use of Traditional Birth Attendants (TBAs). The current study was carried out to evaluate the attractive roles of the TBAs that make pregnant mothers persistently use them.

Methodology:

The study was conducted in Ido and Lagelu local government areas of Oyo State in Nigeria. TBA basic demographic data were collected and were then followed up for a period of six months by trained Nurses and Doctors targeting a total of ten direct observations made per TBA per ANC/delivery.

Results:

There were a total of 146 TBAs out of which 134 fulfilled the inclusion criteria and were recruited into the study. The Male to female ratio was 1/133 and age range was 22–68 years with 70.1 % above 40 years. Seventy two per cent of them had only elementary school and 72%, 30% and 38% had been re-trained by LGA, SMOH and National TBA associations respectively. Post- partum care, counseling services, tender care in labour, easy accessibility, accommodating other relations, installmental payment were observed in all TBAs while 60–98% of them did home visit, assisted in referral and arranged for USS and laboratory services.

Conclusions and Recommendations:

These good practices should be incorporated into formal health sector and attitudinal change in the current health workers across all health care levels should be encouraged. CHEWs should also be primarily involved in home visit in pregnancy and post-natal care services.

]]>
<![CDATA[CrypticIBDcheck: an R package for checking cryptic relatedness in nominally unrelated individuals]]> https://www.researchpad.co/product?articleinfo=5989da59ab0ee8fa60b8f7b2

Background

In population association studies, standard methods of statistical inference assume that study subjects are independent samples. In genetic association studies, it is therefore of interest to diagnose undocumented close relationships in nominally unrelated study samples.

Results

We describe the R package CrypticIBDcheck to identify pairs of closely-related subjects based on genetic marker data from single-nucleotide polymorphisms (SNPs). The package is able to accommodate SNPs in linkage disequibrium (LD), without the need to thin the markers so that they are approximately independent in the population. Sample pairs are identified by superposing their estimated identity-by-descent (IBD) coefficients on plots of IBD coefficients for pairs of simulated subjects from one of several common close relationships.

Conclusions

The methods implemented in CrypticIBDcheck are particularly relevant to candidate-gene association studies, in which dependent SNPs cluster in a relatively small number of genes spread throughout the genome. The accommodation of LD allows the use of all available genetic data, a desirable property when working with a modest number of dependent SNPs within candidate genes. CrypticIBDcheck is available from the Comprehensive R Archive Network (CRAN).

]]>
<![CDATA[Git can facilitate greater reproducibility and increased transparency in science]]> https://www.researchpad.co/product?articleinfo=5989da0bab0ee8fa60b77875

Background

Reproducibility is the hallmark of good science. Maintaining a high degree of transparency in scientific reporting is essential not just for gaining trust and credibility within the scientific community but also for facilitating the development of new ideas. Sharing data and computer code associated with publications is becoming increasingly common, motivated partly in response to data deposition requirements from journals and mandates from funders. Despite this increase in transparency, it is still difficult to reproduce or build upon the findings of most scientific publications without access to a more complete workflow.

Findings

Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science. One such open source VCS, Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. For individual researchers, Git provides a powerful way to track and compare versions, retrace errors, explore new approaches in a structured manner, while maintaining a full audit trail. For larger collaborative efforts, Git and Git hosting services make it possible for everyone to work asynchronously and merge their contributions at any time, all the while maintaining a complete authorship trail. In this paper I provide an overview of Git along with use-cases that highlight how this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.

]]>
<![CDATA[A dedicated database system for handling multi-level data in systems biology]]> https://www.researchpad.co/product?articleinfo=5989dae0ab0ee8fa60bbba34

Background

Advances in high-throughput technologies have enabled extensive generation of multi-level omics data. These data are crucial for systems biology research, though they are complex, heterogeneous, highly dynamic, incomplete and distributed among public databases. This leads to difficulties in data accessibility and often results in errors when data are merged and integrated from varied resources. Therefore, integration and management of systems biological data remain very challenging.

Methods

To overcome this, we designed and developed a dedicated database system that can serve and solve the vital issues in data management and hereby facilitate data integration, modeling and analysis in systems biology within a sole database. In addition, a yeast data repository was implemented as an integrated database environment which is operated by the database system. Two applications were implemented to demonstrate extensibility and utilization of the system. Both illustrate how the user can access the database via the web query function and implemented scripts. These scripts are specific for two sample cases: 1) Detecting the pheromone pathway in protein interaction networks; and 2) Finding metabolic reactions regulated by Snf1 kinase.

Results and conclusion

In this study we present the design of database system which offers an extensible environment to efficiently capture the majority of biological entities and relations encountered in systems biology. Critical functions and control processes were designed and implemented to ensure consistent, efficient, secure and reliable transactions. The two sample cases on the yeast integrated data clearly demonstrate the value of a sole database environment for systems biology research.

]]>
<![CDATA[Identifying large sets of unrelated individuals and unrelated markers]]> https://www.researchpad.co/product?articleinfo=5989db4aab0ee8fa60bd9dc3

Background

Genetic Analyses in large sample populations are important for a better understanding of the variation between populations, for designing conservation programs, for detecting rare mutations which may be risk factors for a variety of diseases, among other reasons. However these analyses frequently assume that the participating individuals or animals are mutually unrelated which may not be the case in large samples, leading to erroneous conclusions. In order to retain as much data as possible while minimizing the risk of false positives it is useful to identify a large subset of relatively unrelated individuals in the population. This can be done using a heuristic for finding a large set of independent of nodes in an undirected graph. We describe a fast randomized heuristic for this purpose. The same methodology can also be used for identifying a suitable set of markers for analyzing population stratification, and other instances where a rapid heuristic for maximal independent sets in large graphs is needed.

Results

We present FastIndep, a fast random heuristic algorithm for finding a maximal independent set of nodes in an arbitrary undirected graph along with an efficient implementation in C++. On a 64 bit Linux or MacOS platform the execution time is a few minutes, even with a graph of several thousand nodes. The algorithm can discover multiple solutions of the same cardinality. FastIndep can be used to discover unlinked markers, and unrelated individuals in populations.

Conclusions

The methods presented here provide a quick and efficient method for identifying sets of unrelated individuals in large populations and unlinked markers in marker panels. The C++ source code and instructions along with utilities for generating the input files in the appropriate format are available at http://taurus.ansci.iastate.edu/wiki/people/jabr/Joseph_Abraham.html

]]>
<![CDATA[RECOT: a tool for the coordinate transformation of next-generation sequencing reads for comparative genomics and transcriptomics]]> https://www.researchpad.co/product?articleinfo=5989da1dab0ee8fa60b7d8d5

Background

The whole-genome sequences of many non-model organisms have recently been determined. Using these genome sequences, next-generation sequencing based experiments such as RNA-seq and ChIP-seq have been performed and comparisons of the experiments between related species have provided new knowledge about evolution and biological processes. Although these comparisons require transformation of the genome coordinates of the reads between the species, current software tools are not suitable to convert the massive numbers of reads to the corresponding coordinates of other species’ genomes.

Results

Here, we introduce a set of programs, called REad COordinate Transformer (RECOT), created to transform the coordinates of short reads obtained from the genome of a query species being studied to that of a comparison target species after aligning the query and target gene/genome sequences. RECOT generates output in SAM format that can be viewed using recent genome browsers capable of displaying next-generation sequencing data.

Conclusions

We demonstrate the usefulness of RECOT in comparing ChIP-seq results between two closely-related fruit flies. The results indicate position changes of a transcription factor binding site caused sequence polymorphisms at the binding site.

]]>
<![CDATA[MIA - A free and open source software for gray scale medical image analysis]]> https://www.researchpad.co/product?articleinfo=5989d9f7ab0ee8fa60b70c3f

Background

Gray scale images make the bulk of data in bio-medical image analysis, and hence, the main focus of many image processing tasks lies in the processing of these monochrome images. With ever improving acquisition devices, spatial and temporal image resolution increases, and data sets become very large.

Various image processing frameworks exists that make the development of new algorithms easy by using high level programming languages or visual programming. These frameworks are also accessable to researchers that have no background or little in software development because they take care of otherwise complex tasks. Specifically, the management of working memory is taken care of automatically, usually at the price of requiring more it. As a result, processing large data sets with these tools becomes increasingly difficult on work station class computers.

One alternative to using these high level processing tools is the development of new algorithms in a languages like C++, that gives the developer full control over how memory is handled, but the resulting workflow for the prototyping of new algorithms is rather time intensive, and also not appropriate for a researcher with little or no knowledge in software development.

Another alternative is in using command line tools that run image processing tasks, use the hard disk to store intermediate results, and provide automation by using shell scripts. Although not as convenient as, e.g. visual programming, this approach is still accessable to researchers without a background in computer science. However, only few tools exist that provide this kind of processing interface, they are usually quite task specific, and don’t provide an clear approach when one wants to shape a new command line tool from a prototype shell script.

Results

The proposed framework, MIA, provides a combination of command line tools, plug-ins, and libraries that make it possible to run image processing tasks interactively in a command shell and to prototype by using the according shell scripting language. Since the hard disk becomes the temporal storage memory management is usually a non-issue in the prototyping phase. By using string-based descriptions for filters, optimizers, and the likes, the transition from shell scripts to full fledged programs implemented in C++ is also made easy. In addition, its design based on atomic plug-ins and single tasks command line tools makes it easy to extend MIA, usually without the requirement to touch or recompile existing code.

Conclusion

In this article, we describe the general design of MIA, a general purpouse framework for gray scale image processing. We demonstrated the applicability of the software with example applications from three different research scenarios, namely motion compensation in myocardial perfusion imaging, the processing of high resolution image data that arises in virtual anthropology, and retrospective analysis of treatment outcome in orthognathic surgery. With MIA prototyping algorithms by using shell scripts that combine small, single-task command line tools is a viable alternative to the use of high level languages, an approach that is especially useful when large data sets need to be processed.

]]>
<![CDATA[BioPatRec: A modular research platform for the control of artificial limbs based on pattern recognition algorithms]]> https://www.researchpad.co/product?articleinfo=5989d9e6ab0ee8fa60b6b3c4

Background

Processing and pattern recognition of myoelectric signals have been at the core of prosthetic control research in the last decade. Although most studies agree on reporting the accuracy of predicting predefined movements, there is a significant amount of study-dependent variables that make high-resolution inter-study comparison practically impossible. As an effort to provide a common research platform for the development and evaluation of algorithms in prosthetic control, we introduce BioPatRec as open source software. BioPatRec allows a seamless implementation of a variety of algorithms in the fields of (1) Signal processing; (2) Feature selection and extraction; (3) Pattern recognition; and, (4) Real-time control. Furthermore, since the platform is highly modular and customizable, researchers from different fields can seamlessly benchmark their algorithms by applying them in prosthetic control, without necessarily knowing how to obtain and process bioelectric signals, or how to produce and evaluate physically meaningful outputs.

Results

BioPatRec is demonstrated in this study by the implementation of a relatively new pattern recognition algorithm, namely Regulatory Feedback Networks (RFN). RFN produced comparable results to those of more sophisticated classifiers such as Linear Discriminant Analysis and Multi-Layer Perceptron. BioPatRec is released with these 3 fundamentally different classifiers, as well as all the necessary routines for the myoelectric control of a virtual hand; from data acquisition to real-time evaluations. All the required instructions for use and development are provided in the online project hosting platform, which includes issue tracking and an extensive “wiki”. This transparent implementation aims to facilitate collaboration and speed up utilization. Moreover, BioPatRec provides a publicly available repository of myoelectric signals that allow algorithms benchmarking on common data sets. This is particularly useful for researchers lacking of data acquisition hardware, or with limited access to patients.

Conclusions

BioPatRec has been made openly and freely available with the hope to accelerate, through the community contributions, the development of better algorithms that can potentially improve the patient’s quality of life. It is currently used in 3 different continents and by researchers of different disciplines, thus proving to be a useful tool for development and collaboration.

]]>
<![CDATA[H3Africa: a tipping point for a revolution in bioinformatics, genomics and health research in Africa]]> https://www.researchpad.co/product?articleinfo=5989dae3ab0ee8fa60bbc739

Background

A multi-million dollar research initiative involving the National Institutes of Health (NIH), Wellcome Trust and African scientists has been launched. The initiative, referred to as H3Africa, is an acronym that stands for Human Heredity and Health in Africa. Here, we outline what this initiative is set to achieve and the latest commitments of the key players as at October 2013.

Findings

The initiative has so far been awarded over $74 million in research grants. During the first set of awards announced in 2012, the NIH granted $5 million a year for a period of five years, while the Wellcome Trust doled out at least $12 million over the period to the research consortium. This was in addition to Wellcome Trust’s provision of administrative support, scientific consultation and advanced training, all in collaboration with the African Society for Human Genetics. In addition, during the second set of awards announced in October 2013, the NIH awarded to the laudable initiative 10 new grants of up to $17 million over the next four years.

Conclusions

H3Africa is poised to transform the face of research in genomics, bioinformatics and health in Africa. The capacity of African scientists will be enhanced through training and the better research facilities that will be acquired. Research collaborations between Africa and the West will grow and all stakeholders, including the funding partners, African scientists, scientists across the globe, physicians and patients will be the eventual winners.

]]>
<![CDATA[TrigNER: automatically optimized biomedical event trigger recognition on scientific documents]]> https://www.researchpad.co/product?articleinfo=5989dabfab0ee8fa60bb02ed

Background

Cellular events play a central role in the understanding of biological processes and functions, providing insight on both physiological and pathogenesis mechanisms. Automatic extraction of mentions of such events from the literature represents an important contribution to the progress of the biomedical domain, allowing faster updating of existing knowledge. The identification of trigger words indicating an event is a very important step in the event extraction pipeline, since the following task(s) rely on its output. This step presents various complex and unsolved challenges, namely the selection of informative features, the representation of the textual context, and the selection of a specific event type for a trigger word given this context.

Results

We propose TrigNER, a machine learning-based solution for biomedical event trigger recognition, which takes advantage of Conditional Random Fields (CRFs) with a high-end feature set, including linguistic-based, orthographic, morphological, local context and dependency parsing features. Additionally, a completely configurable algorithm is used to automatically optimize the feature set and training parameters for each event type. Thus, it automatically selects the features that have a positive contribution and automatically optimizes the CRF model order, n-grams sizes, vertex information and maximum hops for dependency parsing features. The final output consists of various CRF models, each one optimized to the linguistic characteristics of each event type.

Conclusions

TrigNER was tested in the BioNLP 2009 shared task corpus, achieving a total F-measure of 62.7 and outperforming existing solutions on various event trigger types, namely gene expression, transcription, protein catabolism, phosphorylation and binding. The proposed solution allows researchers to easily apply complex and optimized techniques in the recognition of biomedical event triggers, making its application a simple routine task. We believe this work is an important contribution to the biomedical text mining community, contributing to improved and faster event recognition on scientific articles, and consequent hypothesis generation and knowledge discovery. This solution is freely available as open source at http://bioinformatics.ua.pt/trigner.

]]>
<![CDATA[Modular and configurable optimal sequence alignment software: Cola]]> https://www.researchpad.co/product?articleinfo=5989daa8ab0ee8fa60ba8464

Background

The fundamental challenge in optimally aligning homologous sequences is to define a scoring scheme that best reflects the underlying biological processes. Maximising the overall number of matches in the alignment does not always reflect the patterns by which nucleotides mutate. Efficiently implemented algorithms that can be parameterised to accommodate more complex non-linear scoring schemes are thus desirable.

Results

We present Cola, alignment software that implements different optimal alignment algorithms, also allowing for scoring contiguous matches of nucleotides in a nonlinear manner. The latter places more emphasis on short, highly conserved motifs, and less on the surrounding nucleotides, which can be more diverged. To illustrate the differences, we report results from aligning 14,100 sequences from 3' untranslated regions of human genes to 25 of their mammalian counterparts, where we found that a nonlinear scoring scheme is more consistent than a linear scheme in detecting short, conserved motifs.

Conclusions

Cola is freely available under LPGL from https://github.com/nedaz/cola.

]]>
<![CDATA[MOSAL: software tools for multiobjective sequence alignment]]> https://www.researchpad.co/product?articleinfo=5989dad8ab0ee8fa60bb8bfd

Multiobjective sequence alignment brings the advantage of providing a set of alignments that represent the trade-off between performing insertion/deletions and matching symbols from both sequences. Each of these alignments provide a potential explanation of the relationship between the sequences. We introduce MOSAL, a software tool that provides an open-source implementation and an on-line application for multiobjective pairwise sequence alignment.

]]>
<![CDATA[ROVER variant caller: read-pair overlap considerate variant-calling software applied to PCR-based massively parallel sequencing datasets]]> https://www.researchpad.co/product?articleinfo=5989da96ab0ee8fa60ba207d

Background

We recently described Hi-Plex, a highly multiplexed PCR-based target-enrichment system for massively parallel sequencing (MPS), which allows the uniform definition of library size so that subsequent paired-end sequencing can achieve complete overlap of read pairs. Variant calling from Hi-Plex-derived datasets can thus rely on the identification of variants appearing in both reads of read-pairs, permitting stringent filtering of sequencing chemistry-induced errors. These principles underly ROVER software (derived from Read Overlap PCR-MPS variant caller), which we have recently used to report the screening for genetic mutations in the breast cancer predisposition gene PALB2. Here, we describe the algorithms underlying ROVER and its usage.

Results

ROVER enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. The open-source availability of the software and threshold tailorability enables broad access for a range of PCR-MPS users.

Methods

ROVER is implemented in Python and runs on all popular POSIX-like operating systems (Linux, OS X). The software accepts a tab-delimited text file listing the coordinates of the target-specific primers used for targeted enrichment based on a specified genome-build. It also accepts aligned sequence files resulting from mapping to the same genome-build. ROVER identifies the amplicon a given read-pair represents and removes the primer sequences by using the mapping co-ordinates and primer co-ordinates. It considers overlapping read-pairs with respect to primer-intervening sequence. Only when a variant is observed in both reads of a read-pair does the signal contribute to a tally of read-pairs containing or not containing the variant. A user-defined threshold informs the minimum number of, and proportion of, read-pairs a variant must be observed in for a ‘call’ to be made. ROVER also reports the depth of coverage across amplicons to facilitate the identification of any regions that may require further screening.

Conclusions

ROVER can facilitate rapid and accurate genetic variant calling for a broad range of PCR-MPS users.

]]>
<![CDATA[biobambam: tools for read pair collation based algorithms on BAM files]]> https://www.researchpad.co/product?articleinfo=5989db3dab0ee8fa60bd5639

Background

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs.

Results

In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package.

Conclusions

In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.

]]>
<![CDATA[The non-negative matrix factorization toolbox for biological data mining]]> https://www.researchpad.co/product?articleinfo=5989db4aab0ee8fa60bd9e86

Background

Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

Results

We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

Conclusions

A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

]]>
<![CDATA[SVAw - a web-based application tool for automated surrogate variable analysis of gene expression studies]]> https://www.researchpad.co/product?articleinfo=5989daf1ab0ee8fa60bc15f9

Background

Surrogate variable analysis (SVA) is a powerful method to identify, estimate, and utilize the components of gene expression heterogeneity due to unknown and/or unmeasured technical, genetic, environmental, or demographic factors. These sources of heterogeneity are common in gene expression studies, and failing to incorporate them into the analysis can obscure results. Using SVA increases the biological accuracy and reproducibility of gene expression studies by identifying these sources of heterogeneity and correctly accounting for them in the analysis.

Results

Here we have developed a web application called SVAw (Surrogate variable analysis Web app) that provides a user friendly interface for SVA analyses of genome-wide expression studies. The software has been developed based on open source bioconductor SVA package. In our software, we have extended the SVA program functionality in three aspects: (i) the SVAw performs a fully automated and user friendly analysis workflow; (ii) It calculates probe/gene Statistics for both pre and post SVA analysis and provides a table of results for the regression of gene expression on the primary variable of interest before and after correcting for surrogate variables; and (iii) it generates a comprehensive report file, including graphical comparison of the outcome for the user.

Conclusions

SVAw is a web server freely accessible solution for the surrogate variant analysis of high-throughput datasets and facilitates removing all unwanted and unknown sources of variation. It is freely available for use at http://psychiatry.igm.jhmi.edu/sva. The executable packages for both web and standalone application and the instruction for installation can be downloaded from our web site.

]]>
<![CDATA[BatTool: an R package with GUI for assessing the effect of White-nose syndrome and other take events on Myotis spp. of bats]]> https://www.researchpad.co/product?articleinfo=5989da5cab0ee8fa60b9019c

Background

Myotis species of bats such as the Indiana Bat and Little Brown Bat are facing population declines because of White-nose syndrome (WNS). These species also face threats from anthropogenic activities such as wind energy development. Population models may be used to provide insights into threats facing these species. We developed a population model, BatTool, as an R package to help decision makers and natural resource managers examine factors influencing the dynamics of these species. The R package includes two components: 1) a deterministic and stochastic model that are accessible from the command line and 2) a graphical user interface (GUI).

Results

BatTool is an R package allowing natural resource managers and decision makers to understand Myotis spp. population dynamics. Through the use of a GUI, the model allows users to understand how WNS and other take events may affect the population.

The results are saved both graphically and as data files. Additionally, R-savvy users may access the population functions through the command line and reuse the code as part of future research. This R package could also be used as part of a population dynamics or wildlife management course.

Conclusions

BatTool provides access to a Myotis spp. population model. This tool can help natural resource managers and decision makers with the Endangered Species Act deliberations for these species and with issuing take permits as part of regulatory decision making. The tool is available online as part of this publication.

]]>
<![CDATA[Combining de novo and reference-guided assembly with scaffold_builder]]> https://www.researchpad.co/product?articleinfo=5989da75ab0ee8fa60b96638

Genome sequencing has become routine, however genome assembly still remains a challenge despite the computational advances in the last decade. In particular, the abundance of repeat elements in genomes makes it difficult to assemble them into a single complete sequence. Identical repeats shorter than the average read length can generally be assembled without issue. However, longer repeats such as ribosomal RNA operons cannot be accurately assembled using existing tools. The application Scaffold_builder was designed to generate scaffolds – super contigs of sequences joined by N-bases – based on the similarity to a closely related reference sequence. This is independent of mate-pair information and can be used complementarily for genome assembly, e.g. when mate-pairs are not available or have already been exploited. Scaffold_builder was evaluated using simulated pyrosequencing reads of the bacterial genomes Escherichia coli 042, Lactobacillus salivarius UCC118 and Salmonella enterica subsp. enterica serovar Typhi str. P-stx-12. Moreover, we sequenced two genomes from Salmonella enterica serovar Typhimurium LT2 G455 and Salmonella enterica serovar Typhimurium SDT1291 and show that Scaffold_builder decreases the number of contig sequences by 53% while more than doubling their average length. Scaffold_builder is written in Python and is available at http://edwards.sdsu.edu/scaffold_builder. A web-based implementation is additionally provided to allow users to submit a reference genome and a set of contigs to be scaffolded.

]]>
<![CDATA[PFClust: an optimised implementation of a parameter-free clustering algorithm]]> https://www.researchpad.co/product?articleinfo=5989da01ab0ee8fa60b74500

Background

A well-known problem in cluster analysis is finding an optimal number of clusters reflecting the inherent structure of the data. PFClust is a partitioning-based clustering algorithm capable, unlike many widely-used clustering algorithms, of automatically proposing an optimal number of clusters for the data.

Results

The results of tests on various types of data showed that PFClust can discover clusters of arbitrary shapes, sizes and densities. The previous implementation of the algorithm had already been successfully used to cluster large macromolecular structures and small druglike compounds. We have greatly improved the algorithm by a more efficient implementation, which enables PFClust to process large data sets acceptably fast.

Conclusions

In this paper we present a new optimized implementation of the PFClust algorithm that runs considerably faster than the original.

]]>
<![CDATA[Inmembrane, a bioinformatic workflow for annotation of bacterial cell-surface proteomes]]> https://www.researchpad.co/product?articleinfo=5989daf8ab0ee8fa60bc3c1c

Background

The annotation of surface exposed bacterial membrane proteins is an important step in interpretation and validation of proteomic experiments. In particular, proteins detected by cell surface protease shaving experiments can indicate exposed regions of membrane proteins that may contain antigenic determinants or constitute vaccine targets in pathogenic bacteria.

Results

Inmembrane is a tool to predict the membrane proteins with surface-exposed regions of polypeptide in sets of bacterial protein sequences. We have re-implemented a protocol for Gram-positive bacterial proteomes, and developed a new protocol for Gram-negative bacteria, which interface with multiple predictors of subcellular localization and membrane protein topology. Through the use of a modern scripting language, inmembrane provides an accessible code-base and extensible architecture that is amenable to modification for related sequence annotation tasks.

Conclusions

Inmembrane easily integrates predictions from both local binaries and web-based queries to help gain an overview of likely surface exposed protein in a bacterial proteome. The program is hosted on the Github repository http://github.com/boscoh/inmembrane.

]]>