Tutorials

Search...

  1. Identifying toxicities of chemicals with MOE-type descriptors. (APP1)

    Material and Method: Environmental pollution has been one of the most important problems that people all over the world pay attention to. It is necessary to assess the harmful effects or toxicities of chemicals that man is exposed. To understand the chemical toxicity, we use the BioChem tool from BioTriangle webserver (http://biotriangle.scbdd.com) to calculate chemical descriptors and then the random forest (RF) method was applied to build a chemical toxicity classification model.

    The benchmark data sets for building the toxic and non-toxic chemicals classification model were taken from ( Cao D S, et al., 2015). There are five data sets. They are EPA water disinfection by-products with carcinogenicity estimates (DBPCAN), estrogen receptor binding from FDA national center for toxicological research (NCTRER), EPA fathead minnow acute toxicity (EPAFHM), carcinogenic potency database (CPDBAS), and maximum recommended daily dose from FDA center for drug evaluation and research (FDAMDD), respectively. Among these molecules, some individuals cannot be calculated in BioChem were removed from the data sets. We established two groups of models using molecular descriptors and fingerprints respectively. In the first group, to represent each chemical, 369 descriptors (Constitution(30), Topology(35), Kappa(7), EState(237), Moe-Type descriptors(60)) were chosen and calculated using BioChem tool. The calculated descriptors were saved as CSV files. In the second group, five kinds of molecular fingerprints (Daylight fingerprints, ECFP4 fingerprints, Estate fingerprints, FP2 fingerprints and MACCS fingerprints) are used to represent each chemical.

    To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are in the Table 1 and Table 2.

    Tools: BioChem
    Download: CHEMICAL.zip RandomForests.py.zip

  2. Prediction of protein subcellular location. (APP2)

    Material and Method: To identify the functions of proteins in organism is one of the fundamental goals in cell biology and proteomics. The function of a protein in organism is closely linked to its location in a cell. Determination of protein subcellular location (PSL) by experimental methods is expensive and time-consuming. With the enrichment of data repository, automatically prediction of PSL is an alternative method to facilitate the determination of PSL. To build a PSL prediction model, we use the BioProtein webserver tool from BioTriangle website (http://biotriangle.scbdd.com) to calculate protein features and then the random forest (RF) method was applied to build PSL classification model.

    The benchmark data set for building the protein subcellular location predictor was taken from (Jia, Qian et al. 2007). The dataset contains 2568 samples, among them 849 proteins were located at Cytoplasm which is defined as positive dataset and 1619 proteins were located at Nucleus which is defined as negative dataset. For each protein, 20 amino acid composition (AAC), 147 CTD composition, transition and distribution and 30 pseudo amino acid composition (PAAC), a total number of 197 features were calculate through the BioProtein tool. The calculated features were saved as CSV files.

    To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.90, 0.85, 0.94 and 0.69 respectively (Table 3).

    Tools: BioProt
    Download: protein.zip

  3. Identifying recombination spots with basic k-mer features. (APP3)

    Material and Method: Meiotic recombination is one of the most important processes in biology. The genomic regions are defined as hotspots if the meiotic recombination occurs with higher frequencies, so as defined as coldspots with lower frequencies. For the determination of recombination region is expensive and time-consuming, the prediction of hotspots and coldspots given a rapidly and effectively way to automatically identifying the recombination regions. To predict the recombination regions automatically, we use the BioDNA tool from BioTriangle webserver (http://biotriangle.scbdd.com) to calculate DNA features and then the random forest (RF) method was applied to build the recombination region classification model.

    The benchmark data set for building the recombination hotspots and coldspots classification model was taken from (Liu, Liu et al. 2015). The file ‘hotspots.fasta’ containing 490 recombination hotspots is considered as positive dataset, the file ‘coldspots.fasta’ containing 591 recombination coldspots is considered as the negative dataset. To represent each DNA, 64 basic k-mer features were chosen and calculated through BioDNA tool. The calculated features were saved as CSV files.

    To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.89, 0.82, 0.75 and 0.88 respectively (Table 3).

    Tools: BioDNA
    Download: DNA.zip

  4. Prediction of drug–target interaction from the integration of chemical and protein spaces. (APP4)

    Material and Method: Drug-target interactions (DTIs) are central to current drug discovery processes. The rapidly increasing amount of publicly available data in biology and chemistry enables researchers to revisit drugtarget interaction problems by systematic integration and analysis of heterogeneous data.

    To identify the interactions between drugs and targets is of important in drug discovery today. Interaction with ligands can modulate the function of many targets in the processes of signal transport, catalytic reaction and so on. With the enrichment of data repository, automatically prediction of target-protein interactions is an alternative method to facilitate drug discovery. Our previous work (Cao et al, 2014) proved that the calculated features perform well in the prediction of chemical-protein interaction. The benchmark data set to building the protein subcellular location predictor was taken from (Yamanishi, Araki et al. 2008). The dataset contains 5844 samples, among them 2922drug-protein pairs have interactions which are defined as positive dataset and 2922 drug-protein pairs do not have interactions which are defined as negative dataset. To represent each drug-protein pairs, 166 MACCS molecular fingerprints and 20 amino acid composition (AAC) and 147 CTD composition, transition and distribution features of protein, a total number of 313 features were used. The random forest (RF) classifier was employed to build model. The AUC score, accuracy, sensitivity and specificity are 0.94, 0.86, 0.85 and 0.86 respectively (Table 3).
    Tools: BioCPI
    Download: CPI.zip

  5. Prediction of RNA–protein interaction from the integration of RNA and protein spaces. (APP5)

    Material and Method: Systems biology is becoming more and more important for the discovery of new properties of biological systems. One major aim of systems biology is to understand how the various components of biological systems are combined to produce these new properties. ncRNAs are found in all analyzed organisms and regulatory processes involving ncRNA molecules are very common. Most post-transcriptional events are mediated by the association of RNAs with specific proteins or macromolecular protein complexes. It is very import to predict functional interactions between ncRNA and proteins. Here the BioDPI tool from BioTriangle web server (http://biotriangle.scbdd.com) was employed to calculate RNA and protein features and then machine learning method random forest (RF) was applied to building pattern recognition models.

    The benchmark data set to build the protein-RNA interaction classification model was taken from (Wu, Wang et al. 2006). The dataset contains 22382 samples, among them 11191 interactions were confirmed active in vivo or vitro which is defined as positive dataset and 11191 interactions were inactive which is defined as negative dataset. To represent each sample, AAC descriptors, CTD descriptors, PAAC descriptors and DAC descriptors, a total number of 426 features were calculate from our webserver. The calculated features were saved as CSV files. We then implement random forest (RF) classifiers to build model.

    To build the classification model, the CSV file containing the calculated descriptors was then converted to sample matrix (x_train) and a sample label vector (y_train) is also provided. Then, the python script randomforests.py based on sklearn package was employed to build the classification model (the number of trees is 500, the maximum number of features in each tree is square root of the number of features). The performance of this model was evaluated by using 10-fold cross-validation. The AUC score, accuracy, sensitivity and specificity are 0.97, 0.92, 0.92 and 0.91 respectively (Table 3).
    Tools: BioDPI
    Download: RNA-PI.zip

  6. The summary of results of each example.

    Summary:

    Table 1. Summary of five datasets

    Data set

    P

    N

    SEN

    SPE

    ACC

    MCC

    f1_score

    AUC_score

    CPDBAS

    318

    346

    0.75

    0.8

    0.77

    0.55

    0.76

    0.87

    DBPCAN

    76

    95

    0.88

    0.93

    0.91

    0.81

    0.89

    0.97

    EPAFHM

    300

    287

    0.75

    0.82

    0.78

    0.57

    0.78

    0.86

    FDAMDD

    361

    441

    0.81

    0.81

    0.81

    0.62

    0.8

    0.9

    NCTRER

    130

    92

    0.88

    0.84

    0.86

    0.71

    0.88

    0.93


    Table 2. Summary of five datasets

    Fingerprint type

    Data set

    P

    N

    SEN

    SPE

    ACC

    MCC

    f1_score

    AUC_score

    Daylight

    CPDBAS

    319

    346

    0.72

    0.82

    0.77

    0.54

    0.75

    0.83

    DBPCAN

    77

    95

    0.84

    0.88

    0.87

    0.73

    0.85

    0.94

    EPAFHM

    301

    287

    0.69

    0.68

    0.69

    0.37

    0.69

    0.76

    FDAMDD

    361

    441

    0.79

    0.82

    0.8

    0.61

    0.78

    0.89

    NCTRER

    130

    92

    0.84

    0.75

    0.8

    0.59

    0.83

    0.87

    ECFP4

    CPDBAS

    319

    346

    0.73

    0.84

    0.79

    0.58

    0.77

    0.86

    DBPCAN

    77

    95

    0.87

    0.87

    0.87

    0.74

    0.86

    0.96

    EPAFHM

    301

    287

    0.72

    0.7

    0.71

    0.42

    0.72

    0.78

    FDAMDD

    361

    441

    0.82

    0.83

    0.83

    0.66

    0.81

    0.89

    NCTRER

    130

    92

    0.88

    0.87

    0.88

    0.75

    0.89

    0.94

    Estate

    CPDBAS

    319

    346

    0.73

    0.75

    0.74

    0.48

    0.73

    0.81

    DBPCAN

    77

    95

    0.83

    0.94

    0.89

    0.78

    0.87

    0.94

    EPAFHM

    301

    287

    0.68

    0.7

    0.69

    0.38

    0.69

    0.74

    FDAMDD

    361

    441

    0.71

    0.8

    0.76

    0.52

    0.73

    0.83

    NCTRER

    130

    92

    0.86

    0.7

    0.79

    0.57

    0.83

    0.82

    FP2

    CPDBAS

    319

    346

    0.73

    0.8

    0.77

    0.53

    0.75

    0.83

    DBPCAN

    77

    95

    0.88

    0.94

    0.91

    0.82

    0.9

    0.96

    EPAFHM

    301

    287

    0.71

    0.68

    0.69

    0.39

    0.7

    0.78

    FDAMDD

    361

    442

    0.78

    0.82

    0.8

    0.6

    0.78

    0.89

    NCTRER

    130

    92

    0.88

    0.78

    0.84

    0.66

    0.86

    0.9

    MACCS

    CPDBAS

    319

    346

    0.76

    0.83

    0.8

    0.6

    0.78

    0.87

    DBPCAN

    77

    95

    0.9

    0.93

    0.91

    0.82

    0.9

    0.97

    EPAFHM

    301

    287

    0.72

    0.77

    0.75

    0.5

    0.75

    0.8

    FDAMDD

    361

    441

    0.78

    0.82

    0.8

    0.6

    0.78

    0.88

    NCTRER

    130

    92

    0.83

    0.83

    0.83

    0.65

    0.85

    0.9


    Table 3. Summary of last four APPs

    Data set

    P

    N

    SEN

    SPE

    ACC

    AUC_score

    APP2

    849

    1619

    0.94

    0.69

    0.85

    0.9

    APP3

    490

    591

    0.75

    0.88

    0.82

    0.89

    APP4

    2922

    2922

    0.85

    0.86

    0.86

    0.94

    APP5

    11191

    11191

    0.91

    0.92

    0.92

    0.97

  7. Reference.



    1. Cao, D.-S., J. Dong, et al. (2015). "In silico toxicity prediction of chemicals from EPA toxicity database by kernel fusion-based support vector machines." Chemometrics and Intelligent Laboratory Systems 146: 494-502.

    2. Jia, P., Z. Qian, et al. (2007). "Prediction of subcellular protein localization based on functional domain composition." Biochemical and biophysical research communications 357(2): 366-370.

    3. Liu, B., F. Liu, et al. (2015). "repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects." Bioinformatics 31(8): 1307-1309.

    4. Wu, T., J. Wang, et al. (2006). "NPInter: the noncoding RNAs and protein related biomacromolecules interaction database." Nucleic acids research 34(suppl 1): D150-D152.

    5. Yamanishi, Y., M. Araki, et al. (2008). "Prediction of drug–target interaction networks from the integration of chemical and genomic spaces." Bioinformatics 24(13): i232-i240.