Motivation: Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists from various disciplines worldwide is to study the specificity of cleavage activity of SARS-related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as P2P1P1′P2′, the conventional inductive programming method may result in a rule like ‘if P1 = Q, then the sub-sequence is cleaved, otherwise non-cleaved’. If the site P1 is not orthogonal to the others (for instance, P2, P1′ and P2′), the prediction power of these kind of rules may be limited. Therefore this study is aimed at developing a novel method for constructing non-orthogonal decision trees for mining protease data.
Result: Eighteen sequences of coronavirus polyprotein were downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites were experimentally determined. These sequences were scanned using a sliding window with size k to generate about 50 000 k-mer sub-sequences (for short, k-mers). The value of k varies from 4 to 12 with a gap of two. The bio-basis function proposed by Thomson et al. is used to transform the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about 10 out of 50 000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved.
Severe acute respiratory syndrome (SARS) has hit the world since late 2002 and caused more than 8000 infected patients and more than 800 deaths in 25 countries around the world (Yang et al., 2003). SARS has dramatically demonstrated the wide-ranging impact on a highly mobile world. In response to the SARS outbreak, delegates of the 56th World Health Assembly, organized by WHO in May 2003, unanimously adopted a resolution authorizing WHO to act on information arising from sources other than official government notifications. In addition, WHO was asked to conduct on-the-spot investigations to ensure that an affected country has sufficient control to prevent international spread.
A novel coronavirus has been discovered to be associated with the cases of SARS; hence it was named as SARS-CoV (Rota et al., 2002). SARS-CoV as a novel coronavirus (Ksiazek et al., 2003; Marra et al., 2003; Yount et al., 2003) is an infectious respiratory disease. It starts with a fever, chills, headache and body aches, followed by a dry cough within 2–7 days. The most distinguishing feature is breathing difficulty. In severe cases, radiography can provide corroborative evidence of SARS by diagnosing pneumonia. The virus has a spherical enveloped virion. The size is between 80 and 160 nm diameters with a single stranded RNA of about 30 kb, which is the largest genome of all single stranded RNA viruses (Marra et al., 2003; Rota et al., 2002). In the electron micrograph image, glycoproteins on the virus' surface give the virions a halo or crown-like appearance, and hence the name coronavirus.
Coronaviruses are positive-strand RNA viruses with exceptionally large genome sizes (Hegyi and Ziebuhr, 2002). The replication and transcription of coronaviruses are encoded by the replicase genes (Thiel et al., 2001a,b). In many studies, 15-mers (P8P7P6P5P4P3P2P1P1′P2′P3′P4′P5′P6′P7′) were used to represent corresponding 3CLpro cleavage sites in the replicase polyproteins (Merrifield, 1965). It was indicated in Hegyi and Ziebuhr (2002) and Pallai et al. (1989) that the conserved prototypic viruses suggest that the order of cleavage events may occur in all coronaviruses. The structure and dynamics of SARS-CoV protease have been analysed using a molecular dynamics simulation technique (Lee et al., 2003), where molecular docking has been carried out in order to search for potential SARS-CoV protease inhibitors. Previously characterized coronaviruses encode two papain-like cysteine proteases (PL1pro and PL2pro), which cleave the N-proximal polyprotein regions at three sites (Bonilla et al., 1997; Gorbalenya et al., 1991; Herold et al., 1998; Thiel et al., 2001b; Ziebuhr et al., 2001). The recent studies show that 3C-like cysteine protease (3CLpro) cleaves the central and C-proximal regions at 11 conserved sites (Hardy et al., 2002; Hegyi and Ziebuhr, 2002; Tibbles et al., 1999; Ziebuhr et al., 2000).
It is found that the SARS-CoV is unrelated to any well-characterized human coronaviruses although the genome organization is similar to them (Ksiazek et al., 2003; Marra et al., 2003; Rota et al., 2002). Besides, all the SARS-CoV genomes sequenced to date demonstrate surprisingly little variation with mutations at only ∼30 nucleotides. Based on this, Scientists suggest that the virus entered the human population recently from a single point source through mutation. The study has also shown that the SARS-CoV is similar to a civet CoV. It is then hypothesized that the virus could have jumped the species barrier from civets to humans and the SARS-CoV was mutated from non-human coronavirus (Hu et al., 2003).
SARS-CoV gene expression is expected to involve complex transcriptional, translational and post-translational regulatory mechanisms whose molecular details are still unknown (Thiel et al., 2001b). It has been indicated that although vaccines are available for some animal coronaviruses, some of them can promote the disease when vaccinated animals are exposed to the wild-type virus. Moreover, antibody enhancement of disease is a potential risk of SARS vaccines in humans (Holmes, 2003; Lee et al., 2003). This means that it will take many years to develop a good preventive vaccine against SARS-CoV. It was suggested by Lee et al. (2003) to optimize the use of available drugs as inhibitors through studying the enzyme conformation. A recent study used a frequency estimation method to detect the cleavage sites within a specified region based on the knowledge of average length of cleavage products. (Gao et al., 2003). In that study, 12-mers P6P5P4P3P2P1P1′P2′P3′P4′P5′P6′ and 8-mers P4P3P2P1P1′P2′P3′P4′ were used for PLpro and 3CLpro cleavage sites, respectively.
According to our previous experience in analysing protein sequences, the prediction capability of the functional sites (cleavage sites in this study) depends on the inherent patterns in sub-sequences. If the patterns are not complicated (i.e., the sites in the sub-sequences are orthogonal to each other), simple rules can be explored by using inductive programming methods; the prediction of phosphorylation sites in proteins is such a case (Berry et al., 2004). However, most cases are not simple. The success of prediction of the functional sites needs more advanced investigation (Thomson et al., 2003; Yang and Chou, 2004a,b). A novel method is therefore proposed in this study for mining sub-sequence data. Instead of using the sites in k-mers as the inputs to an inductive programming model, it is proposed to transform k-mers to a high-dimensional numerical space through a mapping. An inductive programming method is then used in this high-dimensional space. The transform of k-mers needs a proper function to ensure the biological content in the k-mers can be maintained for data mining. In this study, the bio-basis function proposed in Thomson et al. (2003) is used for this mapping, referred to as a bio-mapping. With the bio-mapping, each k-mer is represented using a vector denoting its position in this high-dimensional space. After the mapping, an inductive programming is employed regarding the mapping vectors as the inputs. The modelling process itself will select the most informative and decisive k-mers to construct a non-orthogonal decision tree. These selected k-mers are then regarded as the templates for decision-making. Importantly, these templates are the representatives for the training k-mers, and hence the knowledge hidden in the training k-mers. Instead of resulting in rules like ‘if P1 = Q, then the k-mer cleaved’, non-orthogonal decision trees will have rules like ‘if a query k-mer is similar to a cleaved template, then the k-mer is cleaved’.
In this study, 18 coronavirus polyproteins were downloaded from GeneBank (NCBI, http://www.ncbi.nlm.nih.gov) for the investigation. After applying the proposed method to this data, it is found that the sensitivity is greatly increased by 30% while the specificity is maintained. Besides, the proposed method even outperformed the early work in the same area using neural networks (Kiemer et al., 2004) by 7% in the sensitivity with a slightly improved specificity.
Eighteen coronavirus polyproteins whose sequences are available were downloaded from the GeneBank (NCBI, http://www.ncbi.nlm.nih.gov). They are NC_004718 (TOR2), NC_002645 (HCoV 229E), NC_001846 (MHV), NC_003045 (BCoV), NC_001451 (IBV), NC_002306 (TGEV), NC_003436 (PEDV), U_00735 (BCoVM), AF391542 (BCoVL), AF220295 (BCoVQ), AF208067 (MHVM), AF201929 (MHV2), AF208066 (MHVP), AY278741 (Urbani), AY278488 (BJ01), AY278554 (CUHK-W1), AY282752 (CUHK-su10) and AY291451 (TW1). Each has 14 cleavage sites. In total, there are 252 cleavage sites.
Decision trees are a kind of inductive programming algorithms (Breiman et al., 1984; Quinlan, 1988). They select a hyper-plane orthogonal to an axis of a variable through maximizing its ‘purity’. If a hyper-plane can make 100% separation between patterns from two classes with respect to a certain threshold value, its purity is 1. Otherwise, the purity value will be <1. Each hyper-plane divides a given region into two disjoint sub-regions. If each sub-region only comprises one class of patterns, the hyper-plane has a purity value as 1. The process of selecting hyper-planes continues till each resulting sub-region is pure for one class. The node which is pure for one class is referred to as a leaf. In terms of this, each leaf has an associated class label. From this, an inductive model is constructed and the training stage is completed. Various tree-pruning methods can be used to prevent over-fitting. In the testing stage, the hyper-planes will progressively lead a novel pattern into a leaf. Up to this end the classification of this novel pattern is completed through assigning the class label associated with the leaf to the novel pattern. Figure 1 shows a decision tree model, where there are two variables and the whole region is divided into six sub-regions. The first hyper-plane makes a separation using the threshold value a for the variable x. Both resulting sub-regions on the left and right of the hyper-plane are not pure. The second hyper-plane divides the left sub-region generated by the first hyper-plane into two smaller sub-regions using the threshold value b for the variable y. The upper sub-region has been pure for one class while the lower sub-region is still not pure and needs more separation.
Decision tree algorithms have been used for bioinformatics research covering many areas, for instance the prediction of Hepatitis C virus protease cleavage sites (Narayanan et al., 2002) and the prediction of phosphorylation sites (Berry et al., 2004). In Kretschmann et al. (2001) C4.5 was used for protein annotation through aligning novel proteins with SWISS-PROT databank. C4.5 was also used for the prediction of phenotypes associated with Saccharomyces cerevisiae genes on the basis of gene ontology functional annotations from the relevant databanks (King et al., 2003). Decision trees were compared with support vector machines for the prediction of the impact of the single nucleotide polymorphisms on protein function (Krishnan and Westhead, 2003). The study shows that the decision trees have the advantage of generating interpreting rules although they have a lower prediction accuracy than support vector machines. In Li et al. (2003) decision trees were used for identifying genes related with cancer, and hence they provided knowledge for cancer diagnosis. King and his colleague studied the use of C4.5 for discovering rules for the prediction of the ORFs whose function is unknown (Clare and King, 2003). The alternative decision tree (Freund and Mason, 1999) was used for predicting the genetic regulatory response (Middendorf et al., 2004). In Selbig et al. (1999) decision trees were used for the prediction of secondary structures. Decision trees were also used in searching short and statistically significant emerging patterns for cancer diagnosis using gene expression profiles (Boulesteix et al., 2003).
The bio-mapping proposed in this study is based on the use of the bio-basis function (Thomson et al., 2003). The basic principle of the bio-basis function is the normalization of pairwise homology alignment scores. Figure 2 shows a query 4-mer (LQSE) aligned with two templates (LQSK and YKAE) to produce two homology alignment scores a (56+48+40+32=176) and b (28+36+36+48=148), respectively. It will be illustrated later in this paper that LQSK and LQSE are cleaved 4-mers. The values 56, 48, 40, 32, 28 and 36 are obtained from the Dayhoff matrix (Johnson and Overington, 1993). For instance, the similarity between the amino acids L and L is 56 whilst the similarity between the amino acids L and Y is 28. Because a > b, it is believed that the query 4-mer shares more functional similarity with the first (cleaved) template.
The method of bio-mapping has been successfully used for the prediction of Trypsin cleavage sites (Thomson et al., 2003), HIV cleavage sites (Yang and Chou, 2004b; Yang and Thomson, 2005), hepatitis C virus protease cleavage sites (Yang and Berry, 2004), disordered protein prediction (Thomson and Esnouf, 2004; Yang et al., 2005), phosphorylation site prediction (Berry et al., 2004), the prediction of the O-linkage sites in glycoproteins (Yang and Chou, 2004a) and the prediction of caspase cleavage sites (Yang, 2005). A thorough review can be seen in Yang (2004).
In this study, the bio-basis function is employed for bio-mapping; i.e. for transforming the given k-mers to a high-dimensional numerical space on which an inductive programming method is employed for constructing a non-orthogonal decision tree. In a constructed decision tree, it is expected that the number of nodes will be much less than the number of given k-mers. As each node employs one k-mer, the selection of the most informative and decisive template is automatically completed in running an inductive programming method. In other words, the less informative k-mers are automatically removed in learning. It is also expected that the prediction accuracy of the non-orthogonal decision trees will not be lower than that of orthogonal trees constructed using the sites in k-mers as inputs.
Step 1. Sequences of 18 polyproteins are scanned using a sliding window with the size of k (k is always an even number). Each scan results in one k-mer denoted as Pk/2 ⋯ P2P1P1′P2′ ⋯ Pk/2′. A k-mer is classified as a cleaved or positive one if there is a cleavage site between P1 and P1′, and otherwise as non-cleaved or negative. The previous reports have used different window sizes (Gao et al., 2003; Hegyi and Ziebuhr, 2002; Herold et al., 1998; Merrifield, 1965; Pallai et al., 1989; Thiel et al., 2001a,b; Thiel et al., 2003; Ziebuhr et al., 2000; Ziebuhr et al., 2001). The window size is varied in this study from 4 to 12 (the largest in recent literature) to investigate the impact of window size on the prediction performance. After this process, about 50 000 k-mers are generated.
Step 2. All the k-mers are divided into 10 folds for 10-fold cross-validation. In each run, nine folds of k-mers are used for constructing orthogonal and non-orthogonal decision trees. The constructed tree is tested on the remaining fold of k-mers.
Step 3. Map nine folds of k-mers into a high-dimensional numerical space using the bio-mapping for constructing non-orthogonal decision trees. There are two bio-mapping strategies:
Step 4. Construct decision trees in the mapped numerical space. From this, draw and analyse the constructed decision trees. The free software package C4.5 is used in this study for tree construction.
Step 5. The prediction performance is assessed using five indicators, the true positive fraction (TPf), true negative fraction (TNf), total accuracy (Total), Matthews' correlation coefficient (MCC) (Matthews, 1975) and the positive prediction power (PPf). Suppose true negatives, true positives, false negatives and false positives are referred to as TN, TP, FN and FP, respectively, the definitions of these indicators are as follows:
Figure 3 shows the performance of three sets of models using window size 4, where ‘NM’ means the models without any bio-mapping and hence orthogonal decision trees, ‘BM1’ the models using the first bio-mapping strategy and ‘BM2’ the second bio-mapping strategy. It can be seen that ‘BM2’ worked the best. The NM models did not work well since they are unable to predict cleaved k-mers well. When the window size is increased, the performance was similar in general (data not shown).
Figure 4 shows a comparison among three sets of models using TPf as the horizontal axis and PPf as the vertical axis. Each set has five points representing five models each of which uses a distinct window size. The closer to the top-right corner it is, the better the model's performance. It can be seen that the set of BM2 models performed the best. The p-value of the t-test between the NM and BM1 models is 0.01 meaning that the null hypothesis that the BM1 model does not improve the prediction accuracy compared with the NM model is denied statistically. The p-value of the t-test between the BM1 and BM2 models is 0.02 also meaning that the null hypothesis that the BM2 model does not improve the prediction compared with BM1 model is denied statistically.
Because the performance of the BM1 model decreases when the window size increases, we use the window size 4 for the comparison of the constructed trees. Shown in Figure 5 is the constructed BM1 tree (or model) for the window size 4, where ‘f(s, LQSK, 1)’ means the output of a bio-basis function using a cleaved (marked by ‘1’, otherwise ‘0’) 4-mer LQSK, ‘< = 0.543351’ means the condition for transforming to one of two sub-trees, Y means that the condition is satisfied, N means that the condition is not satisfied and the triangles represent the class of non-cleaved 4-mers and the circles the class of cleaved 4-mers. Ten of about 50 000 4-mers were selected demonstrating their importance in decision-making. Five are cleaved and five are non-cleaved ones. The root node uses the cleaved 4-mer LQSK as the template. A query 4-mer without confirmed cleavage information is fed to the root node. If the similarity between the query 4-mer and LQSK is <0.543351, the decision is passed by to the sub-tree using GGAP as the root for further comparison. Suppose the query 4-mer has been with the leaf node associated with the non-cleaved 4-mer DYLA, we can investigate if the query 4-mer is cleaved or not. If the similarity between the query 4-mer and DYLA is <0.453845, the query 4-mer is labelled as a cleaved one as it is dissimilar to a non-cleaved 4-mer and otherwise as a non-cleaved one. Suppose the query 4-mer has been with the leaf node associated with the cleaved 4-mer LQAL, we can also investigate if the query 4-mer is cleaved or not. If the similarity between the query 4-mer and LQAL is <0.559898, the query 4-mer is labelled as non-cleaved as it is dissimilar to a cleaved 4-mer; otherwise it is labelled a cleaved one.
As a comparison, Figure 6 shows the constructed BM2 tree, where 11 cleaved 4-mers are selected as the templates for decision-making. Interestingly, the root node again selects the same cleaved 4-mer LQSK as the BM1 tree meaning that this 4-mer is the most important one for decision-making.
We then investigate the individual 4-mers which are selected as templates in either the BM1 or the BM2 model. Figure 7 shows the probability density functions of two 4-mers selected as templates in the decision trees, where the 4-mer LQSK was always selected as the template for the root node in both trees while DYLA was selected for one of the leaf nodes. Note that the thin lines represent the probability density function of the similarity between the non-cleaved 4-mers and a template, and the thick lines the probability density function of the similarity between the cleaved 4-mers and a template. It can be seen that the template LQSK shows high discriminating capability. This is why it is always selected as the template for the root node.
In Kiemer et al. (2004) the sensitivity was only 87% and the specificity was 99%. The sensitivity and specificity values of the BM1 and BM2 models are listed in Table 1. From this table, we can also see that the performance of BM1 decreases when the window size increases while the performance of BM2 maintains unchanged when changing the window size. This may result from the inclusion of long non-cleaved k-mers which disrupted the modelling performance in BM1 models.
This paper has presented a novel method called ‘bio-mapping’ based on non-orthogonal decision trees for mining protease data. Bio-mapping maps k-mers to a numerical space. In this numerical space, decision trees are used for mining protease data. The method has been applied to the prediction of the SARS-CoV protease cleavage sites. It has been found in the simulation that this novel method showed great success in two aspects: success in improving the prediction accuracy and success in extracting the most important k-mers as templates for decision-making. Two strategies were introduced for bio-mapping. The first strategy referred to as ‘BM1’ selects templates from all the k-mers. The second strategy referred to as ‘BM2’ selects templates only from the cleaved k-mers based on the observation made in our earlier study that most non-cleaved k-mers do not show conserved patterns. It has been found through computer simulation that the second strategy performed better.
It should be noted that there are many substitution matrices available, such as Blosum62 (Henikoff and Henikoff, 1992) and the most recently developed one called the composition-adjusted matrix (Yu et al., 2003). The composition-adjusted (for short, adjusted) has been used in this study. It can be seen from Figure 8 that both models performed equally well for the window sizes 4, 6, 8 and 10. However, the adjusted model with the window size 12 performed badly.
It has been mentioned that the 4-mer LQSK seems the most important template selected through this inductive programming learning in the high-dimensional space. Although different papers discussed the use of different sizes of k-mers for the prediction, it has to be noted that the studies were limited to a few polyproteins. The most interesting work described in Hegyi and Ziebuhr (2002) has found two templates (3-mers P2P1P1′) through experiments, i.e. LQS and LQA. This has also been confirmed by other researchers (Ziebuhr and Siddell, 1999; Gao et al., 2003; Thiel et al., 2003; Kiemer et al., 2004). This means that the proposed method in this study is able to extract the important templates for decision-making in an automatic and intelligent way and shows that computer programs are able to assist biological experiments for scientific findings.
The last issue is about window size. It has been found that the increased window size does not improve the sensitivity (Table 1) when using decision trees. However, the biological experiments have found that more than four sites are useful for inhibitor design to fight the disease (Anand et al., 2003; Gao et al., 2003; Thiel et al., 2003; Kiemer et al., 2004). This means that in the real use of a computer program for computer-aided drug design, biological knowledge should be used to alter the information provided by the decision-making system made by a computer program.
The programs were encoded in java and C on a PC containing a 500 MHz Pentium and Linux operating system.
The author thanks Prof. Quinlan for the use of C4.5 and the reviewers for their valuable comments. The author also thanks Singapore Nanyang Technology University for hosting the author for his sabbatical leave in November 2004, during which this work was completed.