ResearchPad - hidden-markov-models https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[ToyArchitecture: Unsupervised learning of interpretable models of the environment]]> https://www.researchpad.co/article/elastic_article_15730 Research in Artificial Intelligence (AI) has focused mostly on two extremes: either on small improvements in narrow AI domains, or on universal theoretical frameworks which are often uncomputable, or lack practical implementations. In this paper we attempt to follow a big picture view while also providing a particular theory and its implementation to present a novel, purposely simple, and interpretable hierarchical architecture. This architecture incorporates the unsupervised learning of a model of the environment, learning the influence of one’s own actions, model-based reinforcement learning, hierarchical planning, and symbolic/sub-symbolic integration in general. The learned model is stored in the form of hierarchical representations which are increasingly more abstract, but can retain details when needed. We demonstrate the universality of the architecture by testing it on a series of diverse environments ranging from audio/visual compression to discrete and continuous action spaces, to learning disentangled representations.

]]>
<![CDATA[Predicting change: Approximate inference under explicit representation of temporal structure in changing environments]]> https://www.researchpad.co/article/5c5ca27ed5eed0c48441e4cc

In our daily lives timing of our actions plays an essential role when we navigate the complex everyday environment. It is an open question though how the representations of the temporal structure of the world influence our behavior. Here we propose a probabilistic model with an explicit representation of state durations which may provide novel insights in how the brain predicts upcoming changes. We illustrate several properties of the behavioral model using a standard reversal learning design and compare its task performance to standard reinforcement learning models. Furthermore, using experimental data, we demonstrate how the model can be applied to identify participants’ beliefs about the latent temporal task structure. We found that roughly one quarter of participants seem to have learned the latent temporal structure and used it to anticipate changes, whereas the remaining participants’ behavior did not show signs of anticipatory responses, suggesting a lack of precise temporal expectations. We expect that the introduced behavioral model will allow, in future studies, for a systematic investigation of how participants learn the underlying temporal structure of task environments and how these representations shape behavior.

]]>
<![CDATA[Utilizing longitudinal microbiome taxonomic profiles to predict food allergy via Long Short-Term Memory networks]]> https://www.researchpad.co/article/5c61e8e9d5eed0c48496f3af

Food allergy is usually difficult to diagnose in early life, and the inability to diagnose patients with atopic diseases at an early age may lead to severe complications. Numerous studies have suggested an association between the infant gut microbiome and development of allergy. In this work, we investigated the capacity of Long Short-Term Memory (LSTM) networks to predict food allergies in early life (0-3 years) from subjects’ longitudinal gut microbiome profiles. Using the DIABIMMUNE dataset, we show an increase in predictive power using our model compared to Hidden Markov Model, Multi-Layer Perceptron Neural Network, Support Vector Machine, Random Forest, and LASSO regression. We further evaluated whether the training of LSTM networks benefits from reduced representations of microbial features. We considered sparse autoencoder for extraction of potential latent representations in addition to standard feature selection procedures based on Minimum Redundancy Maximum Relevance (mRMR) and variance prior to the training of LSTM networks. The comprehensive evaluation reveals that LSTM networks with the mRMR selected features achieve significantly better performance compared to the other tested machine learning models.

]]>
<![CDATA[Triplet-pore structure of a highly divergent TOM complex of hydrogenosomes in Trichomonas vaginalis]]> https://www.researchpad.co/article/5c390bb5d5eed0c48491df0f

Mitochondria originated from proteobacterial endosymbionts, and their transition to organelles was tightly linked to establishment of the protein import pathways. The initial import of most proteins is mediated by the translocase of the outer membrane (TOM). Although TOM is common to all forms of mitochondria, an unexpected diversity of subunits between eukaryotic lineages has been predicted. However, experimental knowledge is limited to a few organisms, and so far, it remains unsettled whether the triplet-pore or the twin-pore structure is the generic form of TOM complex. Here, we analysed the TOM complex in hydrogenosomes, a metabolically specialised anaerobic form of mitochondria found in the excavate Trichomonas vaginalis. We demonstrate that the highly divergent β-barrel T. vaginalis TOM (TvTom)40-2 forms a translocation channel to conduct hydrogenosomal protein import. TvTom40-2 is present in high molecular weight complexes, and their analysis revealed the presence of four tail-anchored (TA) proteins. Two of them, Tom36 and Tom46, with heat shock protein (Hsp)20 and tetratricopeptide repeat (TPR) domains, can bind hydrogenosomal preproteins and most likely function as receptors. A third subunit, Tom22-like protein, has a short cis domain and a conserved Tom22 transmembrane segment but lacks a trans domain. The fourth protein, hydrogenosomal outer membrane protein 19 (Homp19) has no known homology. Furthermore, our data indicate that TvTOM is associated with sorting and assembly machinery (Sam)50 that is involved in β-barrel assembly. Visualisation of TvTOM by electron microscopy revealed that it forms three pores and has an unconventional skull-like shape. Although TvTOM seems to lack Tom7, our phylogenetic profiling predicted Tom7 in free-living excavates. Collectively, our results suggest that the triplet-pore TOM complex, composed of three conserved subunits, was present in the last common eukaryotic ancestor (LECA), while receptors responsible for substrate binding evolved independently in different eukaryotic lineages.

]]>
<![CDATA[Segmenting accelerometer data from daily life with unsupervised machine learning]]> https://www.researchpad.co/article/5c3fa5d4d5eed0c484ca916d

Purpose

Accelerometers are increasingly used to obtain valuable descriptors of physical activity for health research. The cut-points approach to segment accelerometer data is widely used in physical activity research but requires resource expensive calibration studies and does not make it easy to explore the information that can be gained for a variety of raw data metrics. To address these limitations, we present a data-driven approach for segmenting and clustering the accelerometer data using unsupervised machine learning.

Methods

The data used came from five hundred fourteen-year-old participants from the Millennium cohort study who wore an accelerometer (GENEActiv) on their wrist on one weekday and one weekend day. A Hidden Semi-Markov Model (HSMM), configured to identify a maximum of ten behavioral states from five second averaged acceleration with and without addition of x, y, and z-angles, was used for segmenting and clustering of the data. A cut-points approach was used as comparison.

Results

Time spent in behavioral states with or without angle metrics constituted eight and five principal components to reach 95% explained variance, respectively; in comparison four components were identified with the cut-points approach. In the HSMM with acceleration and angle as input, the distributions for acceleration in the states showed similar groupings as the cut-points categories, while more variety was seen in the distribution of angles.

Conclusion

Our unsupervised classification approach learns a construct of human behavior based on the data it observes, without the need for resource expensive calibration studies, has the ability to combine multiple data metrics, and offers a higher dimensional description of physical behavior. States are interpretable from the distributions of observations and by their duration.

]]>
<![CDATA[Macrophage activation by IFN-γ triggers restriction of phagosomal copper from intracellular pathogens]]> https://www.researchpad.co/article/5bfc623ed5eed0c484ec7a25

Copper toxicity and copper limitation can both be effective host defense mechanisms against pathogens. Tolerance of high copper by fungi makes toxicity as a defense mechanism largely ineffective against fungal pathogens. A forward genetic screen for Histoplasma capsulatum mutant yeasts unable to replicate within macrophages showed the Ctr3 copper transporter is required for intramacrophage proliferation. Ctr3 mediates copper uptake and is required for growth in low copper. Transcription of the CTR3 gene is induced by differentiation of H. capsulatum into pathogenic yeasts and by low available copper, but not decreased iron. Low expression of a CTR3 transcriptional reporter by intracellular yeasts implies that phagosomes of non-activated macrophages have moderate copper levels. This is further supported by the replication of Ctr3-deficient yeasts within the phagosome of non-activated macrophages. However, IFN-γ activation of phagocytes causes restriction of phagosomal copper as shown by upregulation of the CTR3 transcriptional reporter and by the failure of Ctr3-deficient yeasts, but not Ctr3 expressing yeasts, to proliferate within these macrophages. Accordingly, in a respiratory model of histoplasmosis, Ctr3-deficient yeasts are fully virulent during phases of the innate immune response but are attenuated after the onset of adaptive immunity. Thus, while technical limitations prevent direct measurement of phagosomal copper concentrations and copper-independent factors can influence gene expression, both the CTR3 promoter induction and the attenuation of Ctr3-deficient yeasts indicate activation of macrophages switches the phagosome from a copper-replete to a copper-depleted environment, forcing H. capsulatum reliance on Ctr3 for copper acquisition.

]]>
<![CDATA[Rosetta FunFolDes – A general framework for the computational design of functional proteins]]> https://www.researchpad.co/article/5bfc6223d5eed0c484ec6c7f

The robust computational design of functional proteins has the potential to deeply impact translational research and broaden our understanding of the determinants of protein function and stability. The low success rates of computational design protocols and the extensive in vitro optimization often required, highlight the challenge of designing proteins that perform essential biochemical functions, such as binding or catalysis. One of the most simplistic approaches for the design of function is to adopt functional motifs in naturally occurring proteins and transplant them to computationally designed proteins. The structural complexity of the functional motif largely determines how readily one can find host protein structures that are “designable”, meaning that are likely to present the functional motif in the desired conformation. One promising route to enhance the “designability” of protein structures is to allow backbone flexibility. Here, we present a computational approach that couples conformational folding with sequence design to embed functional motifs into heterologous proteins—Rosetta Functional Folding and Design (FunFolDes). We performed extensive computational benchmarks, where we observed that the enforcement of functional requirements resulted in designs distant from the global energetic minimum of the protein. An observation consistent with several experimental studies that have revealed function-stability tradeoffs. To test the design capabilities of FunFolDes we transplanted two viral epitopes into distant structural templates including one de novo “functionless” fold, which represent two typical challenges where the designability problem arises. The designed proteins were experimentally characterized showing high binding affinities to monoclonal antibodies, making them valuable candidates for vaccine design endeavors. Overall, we present an accessible strategy to repurpose old protein folds for new functions. This may lead to important improvements on the computational design of proteins, with structurally complex functional sites, that can perform elaborate biochemical functions related to binding and catalysis.

]]>
<![CDATA[Tangled history of a multigene family: The evolution of ISOPENTENYLTRANSFERASE genes]]> https://www.researchpad.co/article/5b6da1ae463d7e4dccc5faea

ISOPENTENYLTRANSFERASE (IPT) genes play important roles in the initial steps of cytokinin synthesis, exist in plant and pathogenic bacteria, and form a multigene family in plants. Protein domain searches revealed that bacteria and plant IPT proteins were to assigned to different protein domains families in the Pfam database, namely Pfam IPT (IPTPfam) and Pfam IPPT (IPPTPfam) families, both are closely related in the P-loop NTPase clan. To understand the origin and evolution of the genes, a species matrix was assembled across the tree of life and intensively in plant lineages. The IPTPfam domain was only found in few bacteria lineages, whereas IPPTPfam is common except in Archaea and Mycoplasma bacteria. The bacterial IPPTPfam domain miaA genes were shown as ancestral of eukaryotic IPPTPfam domain genes. Plant IPTs diversified into class I, class II tRNA-IPTs, and Adenosine-phosphate IPTs; the class I tRNA-IPTs appeared to represent direct successors of miaA genes were found in all plant genomes, whereas class II tRNA-IPTs originated from eukaryotic genes, and were found in prasinophyte algae and in euphyllophytes. Adenosine-phosphate IPTs were only found in angiosperms. Gene duplications resulted in gene redundancies with ubiquitous expression or diversification in expression. In conclusion, it is shown that IPT genes have a complex history prior to the protein family split, and might have experienced losses or HGTs, and gene duplications that are to be likely correlated with the rise in morphological complexity involved in fine tuning cytokinin production.

]]>
<![CDATA[SAFlex: A structural alphabet extension to integrate protein structural flexibility and missing data information]]> https://www.researchpad.co/article/5b4a196a463d7e428027f8b1

In this paper, we describe SAFlex (Structural Alphabet Flexibility), an extension of an existing structural alphabet (HMM-SA), to better explore increasing protein three dimensional structure information by encoding conformations of proteins in case of missing residues or uncertainties. An SA aims to reduce three dimensional conformations of proteins as well as their analysis and comparison complexity by simplifying any conformation in a series of structural letters. Our methodology presents several novelties. Firstly, it can account for the encoding uncertainty by providing a wide range of encoding options: the maximum a posteriori, the marginal posterior distribution, and the effective number of letters at each given position. Secondly, our new algorithm deals with the missing data in the protein structure files (concerning more than 75% of the proteins from the Protein Data Bank) in a rigorous probabilistic framework. Thirdly, SAFlex is able to encode and to build a consensus encoding from different replicates of a single protein such as several homomer chains. This allows localizing structural differences between different chains and detecting structural variability, which is essential for protein flexibility identification. These improvements are illustrated on different proteins, such as the crystal structure of an eukaryotic small heat shock protein. They are promising to explore increasing protein redundancy data and obtain useful quantification of their flexibility.

]]>
<![CDATA[View-Invariant Visuomotor Processing in Computational Mirror Neuron System for Humanoid]]> https://www.researchpad.co/article/5989da1cab0ee8fa60b7d5bf

Mirror neurons are visuo-motor neurons found in primates and thought to be significant for imitation learning. The proposition that mirror neurons result from associative learning while the neonate observes his own actions has received noteworthy empirical support. Self-exploration is regarded as a procedure by which infants become perceptually observant to their own body and engage in a perceptual communication with themselves. We assume that crude sense of self is the prerequisite for social interaction. However, the contribution of mirror neurons in encoding the perspective from which the motor acts of others are seen have not been addressed in relation to humanoid robots. In this paper we present a computational model for development of mirror neuron system for humanoid based on the hypothesis that infants acquire MNS by sensorimotor associative learning through self-exploration capable of sustaining early imitation skills. The purpose of our proposed model is to take into account the view-dependency of neurons as a probable outcome of the associative connectivity between motor and visual information. In our experiment, a humanoid robot stands in front of a mirror (represented through self-image using camera) in order to obtain the associative relationship between his own motor generated actions and his own visual body-image. In the learning process the network first forms mapping from each motor representation onto visual representation from the self-exploratory perspective. Afterwards, the representation of the motor commands is learned to be associated with all possible visual perspectives. The complete architecture was evaluated by simulation experiments performed on DARwIn-OP humanoid robot.

]]>
<![CDATA[A Hidden Markov Model Approach for Simultaneously Estimating Local Ancestry and Admixture Time Using Next Generation Sequence Data in Samples of Arbitrary Ploidy]]> https://www.researchpad.co/article/5989db54ab0ee8fa60bdd131

Admixture—the mixing of genomes from divergent populations—is increasingly appreciated as a central process in evolution. To characterize and quantify patterns of admixture across the genome, a number of methods have been developed for local ancestry inference. However, existing approaches have a number of shortcomings. First, all local ancestry inference methods require some prior assumption about the expected ancestry tract lengths. Second, existing methods generally require genotypes, which is not feasible to obtain for many next-generation sequencing projects. Third, many methods assume samples are diploid, however a wide variety of sequencing applications will fail to meet this assumption. To address these issues, we introduce a novel hidden Markov model for estimating local ancestry that models the read pileup data, rather than genotypes, is generalized to arbitrary ploidy, and can estimate the time since admixture during local ancestry inference. We demonstrate that our method can simultaneously estimate the time since admixture and local ancestry with good accuracy, and that it performs well on samples of high ploidy—i.e. 100 or more chromosomes. As this method is very general, we expect it will be useful for local ancestry inference in a wider variety of populations than what previously has been possible. We then applied our method to pooled sequencing data derived from populations of Drosophila melanogaster on an ancestry cline on the east coast of North America. We find that regions of local recombination rates are negatively correlated with the proportion of African ancestry, suggesting that selection against foreign ancestry is the least efficient in low recombination regions. Finally we show that clinal outlier loci are enriched for genes associated with gene regulatory functions, consistent with a role of regulatory evolution in ecological adaptation of admixed D. melanogaster populations. Our results illustrate the potential of local ancestry inference for elucidating fundamental evolutionary processes.

]]>
<![CDATA[Metagenome and Metatranscriptome Analyses Using Protein Family Profiles]]> https://www.researchpad.co/article/5989da31ab0ee8fa60b849a6

Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx.

]]>
<![CDATA[SMCis: An Effective Algorithm for Discovery of Cis-Regulatory Modules]]> https://www.researchpad.co/article/5989dab0ab0ee8fa60bab1e5

The discovery of cis-regulatory modules (CRMs) is a challenging problem in computational biology. Limited by the difficulty of using an HMM to model dependent features in transcriptional regulatory sequences (TRSs), the probabilistic modeling methods based on HMMs cannot accurately represent the distance between regulatory elements in TRSs and are cumbersome to model the prevailing dependencies between motifs within CRMs. We propose a probabilistic modeling algorithm called SMCis, which builds a more powerful CRM discovery model based on a hidden semi-Markov model. Our model characterizes the regulatory structure of CRMs and effectively models dependencies between motifs at a higher level of abstraction based on segments rather than nucleotides. Experimental results on three benchmark datasets indicate that our method performs better than the compared algorithms.

]]>
<![CDATA[Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments]]> https://www.researchpad.co/article/5989dab5ab0ee8fa60bac816

Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances.

]]>
<![CDATA[Likelihood-Based Inference of B Cell Clonal Families]]> https://www.researchpad.co/article/5989dad8ab0ee8fa60bb8c09

The human immune system depends on a highly diverse collection of antibody-making B cells. B cell receptor sequence diversity is generated by a random recombination process called “rearrangement” forming progenitor B cells, then a Darwinian process of lineage diversification and selection called “affinity maturation.” The resulting receptors can be sequenced in high throughput for research and diagnostics. Such a collection of sequences contains a mixture of various lineages, each of which may be quite numerous, or may consist of only a single member. As a step to understanding the process and result of this diversification, one may wish to reconstruct lineage membership, i.e. to cluster sampled sequences according to which came from the same rearrangement events. We call this clustering problem “clonal family inference.” In this paper we describe and validate a likelihood-based framework for clonal family inference based on a multi-hidden Markov Model (multi-HMM) framework for B cell receptor sequences. We describe an agglomerative algorithm to find a maximum likelihood clustering, two approximate algorithms with various trade-offs of speed versus accuracy, and a third, fast algorithm for finding specific lineages. We show that under simulation these algorithms greatly improve upon existing clonal family inference methods, and that they also give significantly different clusters than previous methods when applied to two real data sets.

]]>
<![CDATA[A Directed Acyclic Graph-Large Margin Distribution Machine Model for Music Symbol Classification]]> https://www.researchpad.co/article/5989da07ab0ee8fa60b76622

Optical Music Recognition (OMR) has received increasing attention in recent years. In this paper, we propose a classifier based on a new method named Directed Acyclic Graph-Large margin Distribution Machine (DAG-LDM). The DAG-LDM is an improvement of the Large margin Distribution Machine (LDM), which is a binary classifier that optimizes the margin distribution by maximizing the margin mean and minimizing the margin variance simultaneously. We modify the LDM to the DAG-LDM to solve the multi-class music symbol classification problem. Tests are conducted on more than 10000 music symbol images, obtained from handwritten and printed images of music scores. The proposed method provides superior classification capability and achieves much higher classification accuracy than the state-of-the-art algorithms such as Support Vector Machines (SVMs) and Neural Networks (NNs).

]]>
<![CDATA[The Influence of Hydroxylation on Maintaining CpG Methylation Patterns: A Hidden Markov Model Approach]]> https://www.researchpad.co/article/5989db1dab0ee8fa60bce79f

DNA methylation and demethylation are opposing processes that when in balance create stable patterns of epigenetic memory. The control of DNA methylation pattern formation by replication dependent and independent demethylation processes has been suggested to be influenced by Tet mediated oxidation of 5mC. Several alternative mechanisms have been proposed suggesting that 5hmC influences either replication dependent maintenance of DNA methylation or replication independent processes of active demethylation. Using high resolution hairpin oxidative bisulfite sequencing data, we precisely determine the amount of 5mC and 5hmC and model the contribution of 5hmC to processes of demethylation in mouse ESCs. We develop an extended hidden Markov model capable of accurately describing the regional contribution of 5hmC to demethylation dynamics. Our analysis shows that 5hmC has a strong impact on replication dependent demethylation, mainly by impairing methylation maintenance.

]]>
<![CDATA[Computational Effective Fault Detection by Means of Signature Functions]]> https://www.researchpad.co/article/5989daefab0ee8fa60bc0b14

The paper presents a computationally effective method for fault detection. A system’s responses are measured under healthy and ill conditions. These signals are used to calculate so-called signature functions that create a signal space. The current system’s response is projected into this space. The signal location in this space easily allows to determine the fault. No classifier such as a neural network, hidden Markov models, etc. is required. The advantage of this proposed method is its efficiency, as computing projections amount to calculating dot products. Therefore, this method is suitable for real-time embedded systems due to its simplicity and undemanding processing capabilities which permit the use of low-cost hardware and allow rapid implementation. The approach performs well for systems that can be considered linear and stationary. The communication presents an application, whereby an industrial process of moulding is supervised. The machine is composed of forms (dies) whose alignment must be precisely set and maintained during the work. Typically, the process is stopped periodically to manually control the alignment. The applied algorithm allows on-line monitoring of the device by analysing the acceleration signal from a sensor mounted on a die. This enables to detect failures at an early stage thus prolonging the machine’s life.

]]>
<![CDATA[4C-ker: A Method to Reproducibly Identify Genome-Wide Interactions Captured by 4C-Seq Experiments]]> https://www.researchpad.co/article/5989db1aab0ee8fa60bcdeb9

4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or “bait”) that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes.

]]>
<![CDATA[Tandem duplications lead to novel expression patterns through exon shuffling in Drosophila yakuba]]> https://www.researchpad.co/article/5989db5cab0ee8fa60be0157

One common hypothesis to explain the impacts of tandem duplications is that whole gene duplications commonly produce additive changes in gene expression due to copy number changes. Here, we use genome wide RNA-seq data from a population sample of Drosophila yakuba to test this ‘gene dosage’ hypothesis. We observe little evidence of expression changes in response to whole transcript duplication capturing 5′ and 3′ UTRs. Among whole gene duplications, we observe evidence that dosage sharing across copies is likely to be common. The lack of expression changes after whole gene duplication suggests that the majority of genes are subject to tight regulatory control and therefore not sensitive to changes in gene copy number. Rather, we observe changes in expression level due to both shuffling of regulatory elements and the creation of chimeric structures via tandem duplication. Additionally, we observe 30 de novo gene structures arising from tandem duplications, 23 of which form with expression in the testes. Thus, the value of tandem duplications is likely to be more intricate than simple changes in gene dosage. The common regulatory effects from chimeric gene formation after tandem duplication may explain their contribution to genome evolution.

]]>