Tutorials
Material and Method: Environmental pollution has been one of the most important problems that people all over the world pay attention to. It is necessary to assess the harmful effects or toxicities of chemicals that man is exposed. To understand the chemical toxicity, we use the BioChem tool from BioTriangle webserver (http://biotriangle.scbdd.com) to calculate chemical descriptors and then the random forest (RF) method was applied to build a chemical toxicity classification model.
The benchmark data sets for building the toxic and non-toxic chemicals classification model were taken from ( Cao D S, et al., 2015). There are five data sets. They are EPA water disinfection by-products with carcinogenicity estimates (DBPCAN), estrogen receptor binding from FDA national center for toxicological research (NCTRER),
EPA fathead minnow acute toxicity (EPAFHM), carcinogenic potency database (CPDBAS), and maximum recommended daily dose from FDA
center for drug evaluation and research (FDAMDD), respectively. Among these molecules, some individuals
cannot be calculated in BioChem were removed from the data sets. We established two groups of models using
molecular descriptors and fingerprints respectively. In the first group, to represent each chemical, 369 descriptors
(Constitution(30), Topology(35), Kappa(7), EState(237), Moe-Type descriptors(60)) were chosen and calculated using BioChem tool. The calculated descriptors were saved as CSV files. In the second group, five kinds of molecular fingerprints (Daylight fingerprints, ECFP4 fingerprints, Estate fingerprints, FP2 fingerprints and MACCS fingerprints) are used to represent each chemical.
To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py
based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are in the Table 1 and Table 2.
Tools: BioChem
Download: CHEMICAL.zip RandomForests.py.zip
Material and Method: To identify the functions of proteins in organism is one of the fundamental goals in cell biology and proteomics. The function of a protein in organism is closely linked to its location in a cell. Determination of protein subcellular location (PSL) by experimental methods is expensive and time-consuming. With the enrichment of data repository, automatically prediction of PSL is an alternative method to facilitate the determination of PSL. To build a PSL prediction model, we use the BioProtein webserver tool from BioTriangle website (http://biotriangle.scbdd.com) to calculate protein features and then the random forest (RF) method was applied to build PSL classification model.
The benchmark data set for building the protein subcellular location predictor was taken from (Jia, Qian et al. 2007). The dataset contains 2568 samples, among them 849 proteins were located at Cytoplasm which is defined as positive dataset and 1619 proteins were located at Nucleus which is defined as negative dataset. For each protein, 20 amino acid composition (AAC), 147 CTD composition, transition and distribution and 30 pseudo amino acid composition (PAAC), a total number of 197 features were calculate through the BioProtein tool. The calculated features were saved as CSV files.
To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.90, 0.85, 0.94 and 0.69
respectively (Table 3).
Tools: BioProt
Download: protein.zip
Material and Method: Meiotic recombination is one of the most important processes in biology. The genomic regions are defined as hotspots if the meiotic recombination occurs with higher frequencies, so as defined as coldspots with lower frequencies. For the determination of recombination region is expensive and time-consuming, the prediction of hotspots and coldspots given a rapidly and effectively way to automatically identifying the recombination regions. To predict the recombination regions automatically, we use the BioDNA tool from BioTriangle webserver (http://biotriangle.scbdd.com) to calculate DNA features and then the random forest (RF) method was applied to build the recombination region classification model.
The benchmark data set for building the recombination hotspots and coldspots classification model was taken from (Liu, Liu et al. 2015). The file ‘hotspots.fasta’ containing 490 recombination hotspots is considered as positive dataset, the file ‘coldspots.fasta’ containing 591 recombination coldspots is considered as the negative dataset. To represent each DNA, 64 basic k-mer features were chosen and calculated through BioDNA tool. The calculated features were saved as CSV files.
To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.89, 0.82, 0.75 and 0.88
respectively (Table 3).
Tools: BioDNA
Download: DNA.zip
Material and Method: Drug-target interactions (DTIs) are central to current drug discovery processes. The rapidly increasing amount of publicly available data in biology and chemistry enables researchers to revisit drugtarget interaction problems by systematic integration and analysis of heterogeneous data.
To identify the interactions between drugs and targets is of important in drug discovery today. Interaction with ligands can modulate the function of many targets in the processes of signal transport, catalytic reaction and so on.
With the enrichment of data repository, automatically prediction of target-protein interactions is an alternative method to facilitate drug discovery. Our previous work (Cao et al, 2014) proved that the calculated features perform well in the prediction of chemical-protein interaction. The benchmark data set to building the protein subcellular location predictor was taken from (Yamanishi, Araki et al. 2008). The dataset contains 5844 samples, among them 2922drug-protein pairs have interactions which are defined as positive dataset and 2922 drug-protein pairs do not have interactions which are defined as negative dataset. To represent each drug-protein pairs, 166 MACCS molecular fingerprints and 20 amino acid composition (AAC) and 147 CTD composition,
transition and distribution features of protein, a total number of 313 features were used.
The random forest (RF) classifier was employed to build model. The AUC score, accuracy, sensitivity and specificity are 0.94, 0.86, 0.85 and 0.86
respectively (Table 3).
Tools: BioCPI
Download: CPI.zip
Material and Method: Systems biology is becoming more and more important for the discovery of new properties of biological systems. One major aim of systems biology is to understand how the various components of biological systems are combined to produce these new properties. ncRNAs are found in all analyzed organisms and regulatory processes involving ncRNA molecules are very common. Most post-transcriptional events are mediated by the association of RNAs with specific proteins or macromolecular protein complexes. It is very import to predict functional interactions between ncRNA and proteins. Here the BioDPI tool from BioTriangle web server (http://biotriangle.scbdd.com) was employed to calculate RNA and protein features and then machine learning method random forest (RF) was applied to building pattern recognition models.
The benchmark data set to build the protein-RNA interaction classification model was taken from (Wu, Wang et al. 2006). The dataset contains 22382 samples, among them 11191 interactions were confirmed active in vivo or vitro which is defined as positive dataset and 11191 interactions were inactive which is defined as negative dataset. To represent each sample, AAC descriptors, CTD descriptors, PAAC descriptors and DAC descriptors, a total number of 426 features were calculate from our webserver. The calculated features were saved as CSV files. We then implement random forest (RF) classifiers to build model.
To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.97, 0.92, 0.92 and 0.91
respectively (Table 3).
Tools: BioDPI
Download: RNA-PI.zip
Summary:
Table 1. Summary of five datasets |
||||||||
Data set |
P |
N |
SEN |
SPE |
ACC |
MCC |
f1_score |
AUC_score |
CPDBAS |
318 |
346 |
0.75 |
0.8 |
0.77 |
0.55 |
0.76 |
0.87 |
DBPCAN |
76 |
95 |
0.88 |
0.93 |
0.91 |
0.81 |
0.89 |
0.97 |
EPAFHM |
300 |
287 |
0.75 |
0.82 |
0.78 |
0.57 |
0.78 |
0.86 |
FDAMDD |
361 |
441 |
0.81 |
0.81 |
0.81 |
0.62 |
0.8 |
0.9 |
NCTRER |
130 |
92 |
0.88 |
0.84 |
0.86 |
0.71 |
0.88 |
0.93 |
Table 2. Summary of five datasets |
|||||||||
Fingerprint type |
Data set |
P |
N |
SEN |
SPE |
ACC |
MCC |
f1_score |
AUC_score |
Daylight |
CPDBAS |
319 |
346 |
0.72 |
0.82 |
0.77 |
0.54 |
0.75 |
0.83 |
DBPCAN |
77 |
95 |
0.84 |
0.88 |
0.87 |
0.73 |
0.85 |
0.94 |
|
EPAFHM |
301 |
287 |
0.69 |
0.68 |
0.69 |
0.37 |
0.69 |
0.76 |
|
FDAMDD |
361 |
441 |
0.79 |
0.82 |
0.8 |
0.61 |
0.78 |
0.89 |
|
NCTRER |
130 |
92 |
0.84 |
0.75 |
0.8 |
0.59 |
0.83 |
0.87 |
|
ECFP4 |
CPDBAS |
319 |
346 |
0.73 |
0.84 |
0.79 |
0.58 |
0.77 |
0.86 |
DBPCAN |
77 |
95 |
0.87 |
0.87 |
0.87 |
0.74 |
0.86 |
0.96 |
|
EPAFHM |
301 |
287 |
0.72 |
0.7 |
0.71 |
0.42 |
0.72 |
0.78 |
|
FDAMDD |
361 |
441 |
0.82 |
0.83 |
0.83 |
0.66 |
0.81 |
0.89 |
|
NCTRER |
130 |
92 |
0.88 |
0.87 |
0.88 |
0.75 |
0.89 |
0.94 |
|
Estate |
CPDBAS |
319 |
346 |
0.73 |
0.75 |
0.74 |
0.48 |
0.73 |
0.81 |
DBPCAN |
77 |
95 |
0.83 |
0.94 |
0.89 |
0.78 |
0.87 |
0.94 |
|
EPAFHM |
301 |
287 |
0.68 |
0.7 |
0.69 |
0.38 |
0.69 |
0.74 |
|
FDAMDD |
361 |
441 |
0.71 |
0.8 |
0.76 |
0.52 |
0.73 |
0.83 |
|
NCTRER |
130 |
92 |
0.86 |
0.7 |
0.79 |
0.57 |
0.83 |
0.82 |
|
FP2 |
CPDBAS |
319 |
346 |
0.73 |
0.8 |
0.77 |
0.53 |
0.75 |
0.83 |
DBPCAN |
77 |
95 |
0.88 |
0.94 |
0.91 |
0.82 |
0.9 |
0.96 |
|
EPAFHM |
301 |
287 |
0.71 |
0.68 |
0.69 |
0.39 |
0.7 |
0.78 |
|
FDAMDD |
361 |
442 |
0.78 |
0.82 |
0.8 |
0.6 |
0.78 |
0.89 |
|
NCTRER |
130 |
92 |
0.88 |
0.78 |
0.84 |
0.66 |
0.86 |
0.9 |
|
MACCS |
CPDBAS |
319 |
346 |
0.76 |
0.83 |
0.8 |
0.6 |
0.78 |
0.87 |
DBPCAN |
77 |
95 |
0.9 |
0.93 |
0.91 |
0.82 |
0.9 |
0.97 |
|
EPAFHM |
301 |
287 |
0.72 |
0.77 |
0.75 |
0.5 |
0.75 |
0.8 |
|
FDAMDD |
361 |
441 |
0.78 |
0.82 |
0.8 |
0.6 |
0.78 |
0.88 |
|
NCTRER |
130 |
92 |
0.83 |
0.83 |
0.83 |
0.65 |
0.85 |
0.9 |
Table 3. Summary of last four APPs |
||||||
Data set |
P |
N |
SEN |
SPE |
ACC |
AUC_score |
APP2 |
849 |
1619 |
0.94 |
0.69 |
0.85 |
0.9 |
APP3 |
490 |
591 |
0.75 |
0.88 |
0.82 |
0.89 |
APP4 |
2922 |
2922 |
0.85 |
0.86 |
0.86 |
0.94 |
APP5 |
11191 |
11191 |
0.91 |
0.92 |
0.92 |
0.97 |
1. Cao, D.-S., J. Dong, et al. (2015). "In silico toxicity prediction of chemicals from EPA toxicity database by kernel fusion-based support vector machines." Chemometrics and Intelligent Laboratory Systems 146: 494-502.
2. Jia, P., Z. Qian, et al. (2007). "Prediction of subcellular protein localization based on functional domain composition." Biochemical and biophysical research communications 357(2): 366-370.
3. Liu, B., F. Liu, et al. (2015). "repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects." Bioinformatics 31(8): 1307-1309.
4. Wu, T., J. Wang, et al. (2006). "NPInter: the noncoding RNAs and protein related biomacromolecules interaction database." Nucleic acids research 34(suppl 1): D150-D152.
5. Yamanishi, Y., M. Araki, et al. (2008). "Prediction of drug–target interaction networks from the integration of chemical and genomic spaces." Bioinformatics 24(13): i232-i240.