An emerging type of therapeutic agent, anticancer peptides (ACPs), has attracted attention because of its lower risk of toxic side effects. However process of identifying ACPs using experimental methods is both time-consuming and laborious. In this study, we developed a new and efficient algorithm that predicts ACPs by fusing multi-view features based on dual-channel deep neural network ensemble model. In the model, one channel used the convolutional neural network CNN to automatically extract the potential spatial features of a sequence. Another channel was used to process and extract more effective features from handcrafted features. Additionally, an effective feature fusion method was explored for the mutual fusion of different features. Finally, we adopted the neural network to predict ACPs based on the fusion features. The performance comparisons across the single and fusion features showed that the fusion of multi-view features could effectively improve the model’s predictive ability. Among these, the fusion of the features extracted by the CNN and composition of k-spaced amino acid group pairs achieved the best performance. To further validate the performance of our model, we compared it with other existing methods using two independent test sets. The results showed that our model’s area under curve was 0.90, which was higher than that of the other existing methods on the first test set and higher than most of the other existing methods on the second test set. The source code and datasets are available at
Cancer is one of the leading causes of death worldwide (
Experimental methods that are currently used for the accurate identification of ACPs are difficult to use in high throughput screening because they are time-consuming and costly. Therefore, it is necessary to develop computational methods that can quickly and accurately identify ACPs. In the past decade, several proposed methods have used traditional machine learning to better identify ACPs.
In this study, we developed a new prediction model: Deep Learning and Feature Fuse-based ACP prediction (DLFF-ACP). First, we used CNN channel to automatically extract the spatial features based on the peptide sequences. The most widely-used handcrafted features were then added to the inputs of the handcrafted feature channel for processing and the extraction of more effective features. Second, the features extracted by CNN channel with peptide sequences named CNN features were fused with the out of handcrafted feature channel and input to a classifier to predict the peptide class. CNN was more effective at considering spatial information (
In this study, we selected datasets from ACPred-Fuse (
In order to input the peptide sequences into the deep learning model, we needed to transform the format of the peptide sequences into numerical vector. We assigned a different number 1 to 20 to each of the 20 amino acids. Since the length of the peptide sequences input into the model should be fixed, we amplified the length of each peptide sequence to 210 by padding zero to fit our dataset’s longest ACP (207 amino acids) and non-ACP (96 amino acids). By tuning weights, the model quickly learned to ignore these padded zeros. The encoding process can be seen in
Different features can represent different information from the amino acid sequence. Here, we used three features that are commonly-used in ACP prediction: amino acid composition (AAC) (
AAC can be used to represent the frequency of each amino acid in the sequence. It is calculated using the following equation:
Datasets | ACPred-Fuse’s dataset | ACPred-FL’s dataset | ||
---|---|---|---|---|
positive | negative | positive | negative | |
Training sets | 250 | 250 | 250 | 250 |
Test sets | 82 | 2628 | 82 | 82 |
where
DPC is defined as the number of possible dipeptide combinations in a given peptide sequence. It can be calculated as:
where
According to their physicochemical properties, the 20 amino acids can be divided into five classes (
Physicochemical property | Amino acid |
---|---|
Aliphatic group (G1) | G, A, V, L, M, I |
Aromatic group (G2) | F, Y, W |
Positive charge group (G3) | K, R, H |
Negative charge group (G4) | D, E |
Uncharged group (G5) | S, T, C, P, N, Q |
The CKSAAGP is used to calculate the frequency of amino acid group pairs separated by any k residues. Using
Each value represents the number of times the residual group pair appears in the peptide sequence. For a peptide sequence with length
In order to improve the recognition ACPs, we designed a new model called DLFF-ACP based on dual-channel DNN. We used Keras framework (
In M1, the handcrafted features were input into a neural network with two full connection layers containing 128 and 64 units, respectively. To reduce overfitting, a dropout rate of 0.2 was used between the two full connection layers. The output of the module fused with the CNN features by concatenating during classification.
In M2, the encoded peptide sequences were input to automatically extract the spatial information features. The encoded peptide sequences were converted to a numeric vector length of 210. These numerical vectors were then input into the embedding layer, where discrete data was converted into fixed-size vectors. The embedded layer could express the corresponding relationship across discrete data. More importantly, the parameters of the embedded layer were constantly updated during the training process, which made the expression of the corresponding relationship even more accurate. We used a 1D convolutional (Conv1D) layer to automatically extract features. The convolutional layer had 32 kernels with a kernel size of 16. These convolutional layers were then fed into a max pooling layer with a kernel size of 8. The max pooling layer was used to reduce the number of parameters and overfitting. After that, the output of the max pooling layer was input to a fully connected layer containing 64 units.
Finally, the output of the M1 was connected with the output of the M2 and served as the input of the module M3. This module consisted of a full connection layer with 64 units and an output layer with one unit. In the output layer, a probability value between 0 and 1 was finally obtained using sigmoid as the activation function. For these probability values, a value greater than 0.5 was considered ‘ACP’, and otherwise was considered ‘non-ACP’.
To evaluate the performance, we adopted four metrics that are widely-used in machine learning for two-class prediction problems: sensitivity (SE), specificity (SP), accuracy (ACC) and Matthew’s correlation coefficient (MCC) (
where TP, TN, FP, and FN represent the number of true positives (i.e., ACPs classified correctly as ACPs), true negatives (i.e., non-ACPs classified correctly as non-ACPs), false positives (i.e., ACPs classified incorrectly as non-ACPs), and false negatives (i.e., non-ACPs classified incorrectly as ACPs), respectively. In order to better measure the classifier’s overall performance, we also used area under curve (AUC) (
In order to better understand the differences between ACPs and non-ACPs, we conducted different types of analyses on the total training and test datasets. The AAC analysis results
(A) AAC. (B) DPC. (C) CKSAAGP.
In our model, we used CNN to extract the spatial features of a sequence. In order to find the best hidden layer setting, we selected different number of filter layers. Since our training set was small 500 samples, we considered using at most two layers when selecting CNN architectures. Using deeper layers meant introducing too many parameters, which could potentially cause overfitting of the model. We chose three different filter sizes: 32,64, and 128.
In our model, the fusing feature information used for classification came from M1 and M2. To have a deeper understanding of the specific performance of these two channels, we compared M1 and M2. In M1, we chose three handcrafted features for comparison. Since there was only a single channel, we removed the model’s concatenate process. The output of a single channel was directly input into the M3 and the final prediction result was obtained. All the results were obtained through 10-fold cross-validation on the training set. The final comparison results are shown in
To select the most effective feature fusion method with handcrafted features, we used CNN extracted features connected with three kinds of handcrafted features: AAC+CNN, CKSAAGP+CNN, and DPC+CNN. In order to verify whether the fused features were conducive to model performance, we compared the fused feature model with the model constructed using only the CNN channel (hereinafter referred to as the CNN model) with the best performance in individual feature models.
Filters | ACC | Sen | Spe | MCC | AUC |
---|---|---|---|---|---|
32 | |||||
64 | 0.80 | 0.79 | 0.79 | 0.60 | 0.88 |
32-64 | 0.78 | 0.76 | 0.80 | 0.57 | 0.87 |
64-128 | 0.78 | 0.78 | 0.78 | 0.56 | 0.87 |
The highest values are highlighted in bold.
Channel | Features | SE | SP | Acc | MCC |
---|---|---|---|---|---|
M1 | AAC | 0.75 | 0.77 | 0.76 | 0.53 |
CKSAAGP | 0.74 | 0.78 | 0.76 | 0.53 | |
DPC | 0.77 | 0.76 | 0.53 | ||
M2 | CNN feature |
The highest values are highlighted in bold.
To verify the performance of our proposed model, we compared it with six traditional machine learning methods: AntiCP (
Feature group | SE | SP | ACC | MCC |
---|---|---|---|---|
CNN+AAC | 0.74 | 0.79 | 0.59 | |
CNN+CKSAAGP | 0.83 | |||
CNN+DPC | 0.80 | 0.79 | 0.80 | 0.60 |
CNN | 0.76 | 0.80 | 0.78 | 0.57 |
The highest values are highlighted in bold.
(A) Distribution of the CKSAAGP feature. (B) Distribution of the Encoding. (C) Distribution of the fusion of handcrafted feature channel and CNN channel.
Methods | SE | SP | ACC | MCC | AUC |
---|---|---|---|---|---|
AntiCP_ACC | 0.68 | 0.89 | 0.88 | 0.29 | 0.85 |
AntiCP_DC | 0.68 | 0.83 | 0.82 | 0.22 | 0.83 |
Hajisharifi’s method | 0.70 | 0.88 | 0.88 | 0.29 | 0.86 |
iACP | 0.55 | 0.89 | 0.88 | 0.23 | 0.76 |
ACPred-FL | 0.70 | 0.86 | 0.85 | 0.26 | 0.85 |
ACPred-Fuse | 0.72 | 0.87 | |||
DeepACP | 0.78 | 0.86 | 0.86 | 0.31 | 0.88 |
ACP-MHCNN | 0.78 | 0.79 | 0.79 | 0.23 | 0.85 |
DLFF-ACP | 0.86 | 0.86 |
The highest values are highlighted in bold.
Methods | SE | SP | ACC | MCC | AUC |
---|---|---|---|---|---|
AntiCP_ACC | 0.68 | 0.87 | 0.77 | 0.56 | 0.83 |
AntiCP_DC | 0.74 | 0.84 | 0.79 | 0.59 | 0.84 |
Hajisharifi’s method | 0.67 | 0.87 | 0.77 | 0.55 | 0.82 |
iACP | 0.68 | 0.80 | 0.74 | 0.49 | 0.80 |
ACPred-FL | 0.81 | 0.88 | 0.78 | ||
DeepACP | 0.89 | 0.77 | 0.83 | 0.66 | 0.87 |
ACP-MHCNN | 0.84 | 0.93 | |||
DLFF-ACP | 0.88 | 0.87 | 0.87 | 0.74 | 0.91 |
The highest values are highlighted in bold.
In this study, we developed a new model for predicting ACPs based on deep learning and multi-view feature fusion. We integrated the handcrafted features into a deep learning framework and predicted the peptide class using a fully connected neural network. Different types of features relay different sequence information. CNN features focus on spatial information, while handcrafted features provide sequence composition information or physical and chemical properties. We compared the single channel model, and the results showed that the CNN features had better results. Additionally, we compared the fusion of different handcrafted features and CNN features, and we found that the fusion of CKSAAGP features and CNN features had the best performance. The fusion of these features enriched the final features and improved the performance of the model. We compared the model with a single channel, and the results showed that the dual-channel model could achieve better performance, which validated our hypothesis. To verify the robustness of our model, we also compared the performance of various models on the test set. In this test set, the number of negative samples was greater than the number of positive samples, which made it closer to the real data in practical application, and the results showed that our model performed better. To test the model’s generalization, we added a dataset with a balance of positive and negative samples in the test set, derived from ACPred-FL. The results showed that our performed better than most models.
ACPs show many strengths in the treatment of cancer, but identifying ACPs with existing experimental methods is time consuming and laborious. In this study, we proposed a fast and efficient ACP predictive model based on dual-channel deep learning ensemble method. By fusing handcrafted features and features extracted by CNN, our model could effectively predict ACPs. Different comparative experiments confirmed that this model had excellent performance. In conclusion, the proposed predictor is more effective and promising for ACP identification and can be used as an alternative tool for predicting ACPs, especially in independent test sets that contain more negative samples. In future research, we will use different network architectures to find latent features, such as generative adversarial networks. Additionally, some methods that have been successfully used in natural language processing may also be considered.
The authors acknowledge the High-performance Computing Platform of Anhui University for providing computing resources.
The authors declare there are no competing interests.
The following information was supplied regarding data availability:
The data and code is available at GitHub: