Motivation: RNA H-type pseudoknots are ubiquitous pseudoknots that are found in almost all classes of RNA and thought to play very important roles in a variety of biological processes. Detection of these RNA H-type pseudoknots can improve our understanding of RNA structures and their associated functions. However, the currently existing programs for detecting such RNA H-type pseudoknots are still time consuming and sometimes even ineffective. Therefore, efficient and effective tools for detecting the RNA H-type pseudoknots are needed.
Results: In this paper, we have adopted a heuristic approach to develop a novel tool, called HPknotter, for efficiently and accurately detecting H-type pseudoknots in an RNA sequence. In addition, we have demonstrated the applicability and effectiveness of HPknotter by testing on some sequences with known H-type pseudoknots. Our approach can be easily extended and applied to other classes of more general pseudoknots.
RNA pseudoknots are found in almost all classes of naturally occurring RNAs and play very important roles in a variety of biological processes, such as RNA replication, transcription and translation (Kolk et al., 1998). The majority of pseudoknots that have been described to date are of the so-called H-type (or classical) pseudoknot in which nucleotides from a hairpin-loop pair with a single-stranded region outside of the hairpin to form a helical stem that is adjacent or almost adjacent to the the hairpin stem (Pleij and Bosch, 1989; Pleij, 1990; ten Dam et al., 1992; Pleij, 1994). For instance, there are 246 different pseudoknots in PseudoBase1 with 224 of them being H-type. Hence, the detection of H-type pseudoknots should improve our understanding of RNA structures and their associated functions. In principle, an H-type pseudoknot (called H-pseudoknot) may contain two stems (regions A and C in Fig. 1) and three loops (regions B, D and E in Fig. 1), where such stems and loops are usually represented in the 5′ → 3′ direction as S1 (Stem 1), S2 (Stem 2), and L1 (Loop 1), L2 (Loop 2) and L3 (Loop 3), respectively. However, L2 is absent in the most studied type of pseudoknots owing to the coaxial stacking of stems. Classical pseudoknots have simple loops in which all nucleotides are unpaired and complicated loops that contain substructures without pseudoknots, such as several stems with their own internal, hairpin and multibranch loops. Both simple and complicated loops are referred to as pseudoknot loops. For simplicity, all the nucleotides in a pseudoknot loop are counted and their number equals to the size of this loop, whether they are unpaired or not. The pseudoknot stems adopted here are those that are ‘pseudoknotted’ with other stems. They may be interrupted by some bulge loops (or interior loops). By convention, the unpaired nucleotides in these loops are, however, not counted for determining the size of a pseudoknot stem.
In the standard thermodynamic model, a pseudoknot-free RNA secondary structure of minimum free energy (MFE) can be computed using dynamic programming in
In addition to the above thermodynamic approaches, several other approaches for predicting RNA secondary structures with (H-type) pseudoknots have been proposed, such as maximum weighted matching (Cary and Stormo, 1995; Tabaska et al., 1998; Ieong et al., 2003), quasi-Monte Carlo searches (Abrahams et al., 1990; Gultyaev, 1991), genetic algorithms (van Batenburg et al., 1995; Gultyaev et al., 1995; Shapiro and Wu, 1997), stochastic context free grammar (Brown and Wilson, 1996; Cai et al., 2003), and others (Ieong et al., 2003; Tahi et al., 2003; Ruan et al., 2004). Particularly, Shapiro and Wu (1997) developed a parallel genetic algorithm for detecting H-pseudoknots on a massively parallel supercomputer MasPar MP-2 with 16 384 processors. Recently, this parallel genetic algorithm has been adapted to MIMD parallel machines (Shapiro et al., 2001), such as SGI ORIGIN 2000 with 64 processors and CRAY T3E with 512 processors, which seem to be hardly accessible to the ordinary users.
To simplify algorithmic computation, the H-pseudoknots are classified into four classes as shown in Table 1 based on the sizes of their stems and loops, where the case of size(S1) = size(S2) and size(L1) = size(L3) is allowed to belong to any of four classes. Basically, our designed HPknotter works with five phases as follows (see Fig. 2 for its flow diagram). In the first phase, it runs RNAMotif on the input RNA sequence with a user-specified descriptor for a class of H-pseudoknots, which produces a list of sequence fragments, called hits, that match the user-specified descriptor. RNAMotif (Macke et al., 2001) is an RNA structural motif search tool to find the fragments of a given RNA sequence that conform to a predefined descriptor of defining a particular structural motif. In Figure 3, the RNAMotif descriptor used in our HPknotter to describe the H-type pseudoknotted structures of class 2 is shown. To define the descriptor of each class of H-pseudoknots that fits as closely as possible to the naturally occurring pseudoknots, we further count the frequencies of the occurring stem sizes and loop sizes in PseudoBase (van Batenburg et al., 2000; van Batenburg et al., 2001). The stem- and loop-size distributions of S1, S2, L1, L2 and L3 are shown in Figure 4, where 4 (respectively, 1 and 3) pseudoknots with big loop-size (≥100 bp) are omitted in the case of L1 (respectively, L2 and L3). Then the size ranges that cover the most parts of the distributions are chosen to serve as the default size ranges of the stems and loops in HPknotter, where these default size ranges can be modified by the users to meet their requirements according to their biological knowledge about the tested data.
The hit sequences contained in the output of the first stage then serve as input to the next phase. Note that at this moment, each hit has the possibility of folding into the pseudoknotted structure of the H-type as defined in the descriptor of RNAMotif (herein, the H-pseudoknot of this kind is referred to as an RNAMotif H-pseudoknot for convenience). However, whether or not this RNAMotif-pseudoknotted structure is the native structure of the hit, i.e. the stable structure with minimum energy, is still unknown. The simplest verification way is to apply the currently existing prediction program (like PKNOTS/NUPACK/pknotsRG) to each hit sequence and examine whether it indeed folds into a stable H-pseudoknot conforming to the descriptor. However, such a verification for all hit sequences is impractical. The reason is that even for a short RNA sequence, a great number of hit sequences are usually produced by RNAMotif and hence the verification of each hit sequence using PKNOTS, NUPACK or pknotsRG costs much time, which leads the overall process of verification above to being extremely time consuming. Therefore, a more efficient verification is needed to improve the overall performance, especially in speed.
From the thermodynamic viewpoint, a pseudoknotted structure of a hit sequence with very low energy (or the lowest energy) is more likely to form in the native structure of the hit sequence. For a hit sequence, however, if the energy of the pseudoknotted structure with possible stems in their loops (defined by the descriptor) is much greater than that of its pseudoknot-free secondary structure with minimum energy, this hit sequence is unlikely to fold into a native pseudoknot that conforms to the descriptor. And as a result, this hit sequence can be discarded directly without any verification. Based on this observation, a hit filter is designed herein to filter out those hit sequences whose energies calculated based on their RNAMotif-pseudoknotted structures with possible stems in their loops are greater than the minimum energies of their pseudoknot-free secondary structures predicted by the pseudoknot-free secondary structure prediction programs. To make this comparison, the energies of the above pseudoknotted and pseudoknot-free structures are recalculated using the energy computation program provided by NUPACK such that the computed energies are based on the same energy rules and thermodynamic parameters. Note that when computing the energy of the pseudoknotted structure of each hit sequence, we also count the possible energy contributed by the interaction between the hit sequence and the flankingsequences.
Currently, the cost of calculating a secondary structure without pseudoknots is much less than that of predicting a secondary structure with pseudoknots. For example, PKNOTS and NUPACK both cost
In the third phase, the filtered hits are further double checked by the pseudoknotted prediction program PKNOTS/NUPACK/pknotsRG to check whether or not they indeed fold into the stable pseudoknots. A filtered hit is then called as an H-pseudoknot candidate if PKNOTS/NUPACK/pknotsRG is able to fold it into a stable pseudoknot.
It is worth mentioning that each H-pseudoknot candidate generated in the third phase may not be an H-pseudoknot, or may be an H-pseudoknot not capable of conforming to the user-specified descriptor. The reason for the former case is that PKNOTS and NUPACK can predict a more general class of pseudoknots. One reason for the latter case is that one of its H-pseudoknot stems may contain a long loop that violates the known biological knowledge. According to the H-pseudoknots maintained in PseudoBase, most of them contain no loop in their pseudoknot stems. Only a few H-pseudoknots contain one loop in their pseudoknot stems and most of them contain either an interior loop of size 2 or a bulge of size 1. Another possible reason is that the candidate is indeed a stable H-pseudoknot, but it belongs to a different class of H-pseudoknots. Based on these observations, in the fourth phase we further design an H-pseudoknot filter to filter out those H-pseudoknot candidates that are not the desired H-pseudoknots or contain a long loop in their stems. We call the remaining H-pseudoknot candidates passing through the H-pseudoknot filter as the filteredH-pseudoknots.
In fact, several filtered H-pseudoknots may overlap among their ranges in the sequence, which means that they cannot exist in the stable structure of a given RNA sequence simultaneously. Among the filtered H-pseudoknots, therefore, we further find the mutually disjoint H-pseudoknots whose total free energy is minimum in the fifth phase. Actually, this problem becomes a well-known combinatorial problem, called the maximum weight independent set problem on interval graphs, if the range of each filtered H-pseudoknot is considered as an interval in the sequence associated with the magnitude of its free energy as the weight. The maximum weight independent set problem on interval graphs can be solved in linear time (Hsiao et al., 1992). In HPknotter, we have implemented this algorithm to compute the mutually disjoint H-pseudoknots with minimum total free energy among the filtered H-pseudoknots and use them as the final output of HPknotter.
Based on the phases described in the previous section, we have implemented a novel tool, the HPknotter, by incorporating several existing programs, RNAMotif (Macke et al., 2001), PKNOTS (Rivas and Eddy, 1999), NUPACK (Dirks and Pierce, 2003) and pknotsRG (Reeder and Giegerich, 2004), for detecting the H-pseudoknots of a given RNA sequence. The HPknotter was written in Perl. Its web server, implemented in PHP, is available for online analysis at http://bioalgorithm.life.nctu.edu.tw/HPKNOTTER/. We incorporated the well-developed programs PKNOTS, NUPACK and pknotsRG into our HPknotter pipeline, and compared this combination with these three programs used as stand-alone tools. The experiments were carried out on a number of RNA sequences with known H-pseudoknots. Unless otherwise specified, all programs were run with default parameters on IBM PC with 3.06 GHz processor and 2 GB RAM under Linux system.
The tested sequences were taken from the 5S rRNA of Escherichia coli (5S-rRNA) (Cannone et al., 2002), the RNA sequence inhibiting human immunodeficiency virus type 1 (HIV-1-RT) reverse transcriptase (Tuerk et al., 1992), the 3′-UTR of tobacco mosaic virus (TMV-3′) (van Belkum et al., 1985), the turnip yellow mosaic virus (TYMV-3′) sequence (Rietveld et al., 1982), the 5′-UTR of human parechovirus (HPeV1-5′) (Nateri et al., 2002), the bacteriophage T2 and T4 gene 32 mRNA sequences (T2 and T4) (McPheeters et al., 1988), and the 3′-UTRs of several coronaviruses (BCV-3′, MHV-3′ and SARS-TW1-3′) including severe acute respiratory syndrome virus (SARS) (Williams et al., 1999; Tsai et al., 2004) (see Table 2 for the information of the tested sequences and their H-pseudoknot numbers). All sequences above, except 5S-rRNA, are known to contain at least one H-pseudoknot as reported in the literature.
A summary of the overall sensitivity and specificity for all experiments, which were run using the general class of the descriptor without an interior or bulge loop in the pseudoknot stems, is shown in Tables 3, in which we let Sbp = (100 × TP)/(TP + FN), Pbp = (100 × TP)/(TP + FP) and Π = (number of correctly predicted H-pseudoknots)/(number of predictedH-pseudoknots) (i.e. the fraction of the correctly predicted H-pseudoknots), where TP = true positive (i.e., the number of the correctly predicted base-pairs in the predicted H-pseudoknots), FN = false negative (i.e. the number of the base-pairs in the published H-pseudoknots that were not predicted), FP = false positive (i.e. the number of the incorrectly predicted base-pairs in the predicted H-pseudoknots). The correctly predicted H-pseudoknots denote those predicted H-pseudoknots reported in the literature.
In this set of experiments, PKNOTS and NUPACK were not able to deal with the cases of T2, T4, BCV-3′, MHV-3′ and SARS-TW1-3′, owing to the running out of memory. For the other sequences, PKNOTS and NUPACK exhibited almost the same prediction results in which the H-pseudoknot of HIV-1-RT was identified, but the H-pseudoknots of TMV-3′-up, TYMV-3′ and HPeV1-5′ were missed2. (Note that PKNOTS could predict two real H-pseudoknots of TMV-3′-down, if the version of PKNOTS was 1.04, instead of 1.01.) Notably, most of the above results were improved when we conducted all the experiments using pknotsRG. However, the H-pseudoknots of T4, SARS-TW1-3′ and TMV-3′-down were still missed by pknotsRG. The inability of detecting the real H-pseudoknots described above evidences the fact that for the long RNA sequence, the MFE model might miss the H-pseudoknots that are actually present in the native structure. In our experiments (Table 3), however, this situation was significantly improved by our HPknotter, because most of the real H-pseudoknots of TMV-3′-up, T4, TYMV-3′, SARS-TW1-3′ and TMV-3′-down were detected with high sensitivity and specificity.
The key point lies in the fact that our HPknotter first uses RNAMotif to search for all fragments of the given RNA sequence that have the possibility of folding into an H-pseudoknot and then applies PKNOTS/NUPACK/pknotsRG to these fragments for determining if their MFE structures are indeed H-pseudoknots. In this situation, without effect on the nucleotides outside the fragments, PKNOTS/NUPACK/pknotsRG seems to give a higher probability of successfully recognizing the pseudoknotted structures of fragments. This approach, of course, inevitably increases the number of incorrectly predicted H-pseudoknots, because it ignores the global effect of all input nucleotides by considering just the local fragments of the input RNA sequence. In fact, our experiments showed that the number of the incorrectly predicted H-pseudoknots was reasonable because among all these predicted H-pseudoknots, HPknotter applies the concept of maximum weight independent set at the last stage to compute the mutually disjoint H-pseudoknots with minimum total free energy.
Generally speaking, as shown in Table 3, our HPknotter greatly improves sensitivity, specificity and the fraction Π of correctly predicted H-pseudoknots when compared with original PKNOTS, NUPACK and pknotsRG. It should be noted that the number of incorrectly predicted H-pseudoknots in the cases with PKNOTS-kernel are not greater than those in the cases with NUPACK-kernel and pknotsRG-kernel, which seems to imply that PKNOTS itself is more accurate than NUPACK and pknotsRG, even though PKNOTS is more time consuming than NUPACK and pknotsRG from the computational point of view.
It is worth mentioning that as shown in Table 4, the overall prediction accuracy will be further improved if we rerun all tested RNA sequences above, except 5S-rRNA containing no H-pseudoknot, by choosing the specific class to which the predicted H-pseudoknots belong, instead of using the general class of descriptor. Particularly, the Π values (Table 4) and the performance of running time (Table 5) were greatly improved. These experiments indicate that our HPknotter can be served as an effective tool for validating if the tested RNA sequences have the same kind of H-pseudoknots as other closely related RNA sequences whose H-pseudoknots are already known in advance. For instance, SARS, BCV and MHV are all coronaviruses, and the H-pseudoknots of BCV-3′ and MHV-3′, both of which belong to class 2 of H-pseudoknots, are already known and have been proven by previous experiments (Williams et al., 1999). It is reasonable to expect that SARS-TW1-3′ may contain an H-pseudoknot of class 2. Therefore, we can apply our HPknotter to SARS-TW1-3′ by specifying the descriptor to be class 2 so that we are able to quickly obtain the same result as the general descriptor.
In fact, our HPknotter is not CPU intensive at all because based on our experiments, a great number of the hit sequences produced by RNAMotif were filtered out by the hit filter. Take the experiments with SARS-TW1-3′ in Table 3 for an example. In the first phase, RNAMotif in total found 2132 hits that conform to the descriptor of general class. If we directly apply PKNOTS to all these unfiltered hits to check if they fold into a stable H-pseudoknot, then the program will require about 51 h to finish the job. However, after running the hit filter, only 43 different hit sequences remained, which then cost the following PKNOTS only about 5.2 min to determine if they are stable pseudoknots. As a result, the third phase of running pseudoknot prediction with PKNOTS left us with only 11 pseudoknot candidates that could fold into stable pseudoknots. Next, only seven candidates remained after running the H-pseudoknot filter in the fourth phase. In fact, some of these filtered H-pseudoknots may have an overlap among their ranges in the sequence, which suggests that they cannot exist simultaneously in a stable pseudoknotted structure in SARS-TW1-3′. Finally, only two H-pseudoknots with minimum free energy were selected in the phase of computing the maximum weight independent set. Table 1 lists the CPU usage time for PKNOTS, NUPACK, pknotsRG and our HPknotter, where all tests were run on IBM PC with 3.06 GHz processor and 2 GB RAM under Linux system.
In this paper, we designed a heuristic approach for efficiently and accurately detecting RNA H-pseudoknots, the ubiquitous pseudoknots in the naturally occurring RNAs. The currently existing thermodynamic-based programs, like PKNOTS, NUPACK and pknotsRG, are useful for finding stable H-pseudoknots. However, most of them are highly time consuming and memory consuming, which limits them to predict short sequences of a couple of hundred bases length. Another main weakness of these programs is that they may not be effective to detect the actually existing H-pseudoknots that are contained in a long RNA sequence, as evidenced by our experiments. Based on our heuristic approach mentioned in this paper, we have implemented a novel program, the HPknotter, capable of efficiently and accurately detecting the H-pseudoknots of a given RNA sequence by incorporating four existing programs RNAMotif, PKNOTS, NUPACK and pknotsRG. In summary, we have demonstrated the practicability and effectiveness of our developed HPknotter by testing it on several RNA sequences, most of which have been proven to contain the H-pseudoknotted structures. Through several experiments, our HPknotter has been shown to be practical for the detection of H-pseudoknots in RNA sequences because it is not computationally expensive and has much better sensitivity and specificity than PKNOTS, NUPACK and pknotsRG. In addition, it is feasible to extend and apply our heuristic approach to detecting the other classes of more generalpseudoknots.
Conflict of Interest: none declared.
|Length (bp)||PKNOTS||NUPACK||pknotsRG||HPknotter (General class)||HPknotter (Specific class)|
|84||7.3 min||13.1 s||0.05 s||31 s||27 s||26 s||9 s||7 s||6 s|
|105||35 min||44.7 s||0.1 s||2.2 min||35 s||29 s||38 s||10 s||8 s|
|200||72 h||—||0.8 s||5.2 min||1.8 min||1.5 min||1.6 min||33 s||30 s|
|341||—||—||7.4 s||7.1 min||2.4 min||2.3 min||2.2 min||46 s||45 s|
|946||—||—||10.1 min||13.8 min||7.5 min||6.9 min||4.1 min||2.2 min||2.1 min|
|1340||—||—||43.5 min||35.3 min||11.6 min||10.9 min||11.6 min||3.1 min||2.5 min|