Ith four or more cysteine residues from the Antimicrobial Peptides Database
Ith four or more cysteine residues from the Antimicrobial Peptides Database (APD) [35]. This set was manually curated, keeping only the sequences annotated at least with activities against bacteria, fungi or virus. In addition, incomplete sequences were removed. PS was composed of 385 sequences with size ranging from 16 to 90 amino acid residues. The negative data set (NS) was composed of a subset of Protein Data Bank (PDB), while in our previous work it was composed of random proteins predicted as transmembrane [20]. Initially, the protein sequencesFigure 1. Principal component analysis of sequence descriptors for cysteine-stabilized peptides. The components are indicated by arrows: as larger the arrow is, major is the component contribution to the set’s variance. (A) The disposition of the nine sequence descriptors in the peptide space; (B) the final ensemble of descriptors, the descriptors hydrophobic moment, index of b-sheet formation, rate between charged and hydrophobic residues and a-helix propensity were ruled out. doi:10.1371/Dimethylenastron journal.pone.0051444.gCS-AMPPred: The Cysteine-Stabilized AMPs PredictorFigure 2. Distribution of sequence descriptor values. The left box in each panel corresponds to the AMPs. All descriptors have statistical differences when compared to the non-antimicrobial data set, with a get CASIN critical value of 0.05. The observed p-values are as follows: charge (,2.2e-16), hydrophobicity (2.169e-06), flexibility (,2.2e-16), index of a-helix formation (,2.2e-16) and index of loop formation (2.908e-10). doi:10.1371/journal.pone.0051444.gsubsequently the descriptors with redundant behavior or with little influence on variance were removed. Therefore, 18297096 a two sided Wilcoxon-Mann-Whitney non-parametric test was applied forverifying the differences between the sequence descriptors in the PS and NS sets, with a critical value of 0.05. The statistical analyses were done through the R package for statistical computing (http://www.r-project.org).Support Vector Machine’s Training and ValidationThree SVM models were developed through SVM Light [41], using the linear, polynomial and radial kernels. The training was done using the training set. An overview of the model’s accuracy was estimated through a 5-fold cross validation, taking into account only the training data set. Therefore, the models were challenged against the blind data set, where the following parameters were measured: Sensitivity TP |100 TPzFN ??SpecificityTN |100 TNzFP??AccuracyFigure 3. ROC curves for the CS-AMPPred models against the blind data set (BS1). doi:10.1371/journal.pone.0051444.gTPzTN |100 TPzTNzFNzFP??CS-AMPPred: The Cysteine-Stabilized AMPs PredictorTable 1. Evaluation of CS-AMPPred models against the individual cysteine-stabilized AMP classes and also PDB sequences which were not used in the data sets.a-defensins1 93.33 97.78 97.78 b-defensins1 96.83 95.24 96.83 CSab defensins1 81.36 77.12 77.12 Cyclotides1 70.34 81.36 83.05 Undefined1 84.13 79.37 80.95 PDB# 80.65 82.55 81.Model Linear Polynomial Radial#Antimicrobial Peptide Classes, values computed through equation 1 (Sensitivity). Non Antimicrobial Peptides, values computed through equation 2 (Specificity), using the 1364 sequences from PDB which were not included in NS. doi:10.1371/journal.pone.0051444.tTP |100 PPV TPzFP??(TP|TN){(FP|FN) MCC pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi.Ith four or more cysteine residues from the Antimicrobial Peptides Database (APD) [35]. This set was manually curated, keeping only the sequences annotated at least with activities against bacteria, fungi or virus. In addition, incomplete sequences were removed. PS was composed of 385 sequences with size ranging from 16 to 90 amino acid residues. The negative data set (NS) was composed of a subset of Protein Data Bank (PDB), while in our previous work it was composed of random proteins predicted as transmembrane [20]. Initially, the protein sequencesFigure 1. Principal component analysis of sequence descriptors for cysteine-stabilized peptides. The components are indicated by arrows: as larger the arrow is, major is the component contribution to the set’s variance. (A) The disposition of the nine sequence descriptors in the peptide space; (B) the final ensemble of descriptors, the descriptors hydrophobic moment, index of b-sheet formation, rate between charged and hydrophobic residues and a-helix propensity were ruled out. doi:10.1371/journal.pone.0051444.gCS-AMPPred: The Cysteine-Stabilized AMPs PredictorFigure 2. Distribution of sequence descriptor values. The left box in each panel corresponds to the AMPs. All descriptors have statistical differences when compared to the non-antimicrobial data set, with a critical value of 0.05. The observed p-values are as follows: charge (,2.2e-16), hydrophobicity (2.169e-06), flexibility (,2.2e-16), index of a-helix formation (,2.2e-16) and index of loop formation (2.908e-10). doi:10.1371/journal.pone.0051444.gsubsequently the descriptors with redundant behavior or with little influence on variance were removed. Therefore, 18297096 a two sided Wilcoxon-Mann-Whitney non-parametric test was applied forverifying the differences between the sequence descriptors in the PS and NS sets, with a critical value of 0.05. The statistical analyses were done through the R package for statistical computing (http://www.r-project.org).Support Vector Machine’s Training and ValidationThree SVM models were developed through SVM Light [41], using the linear, polynomial and radial kernels. The training was done using the training set. An overview of the model’s accuracy was estimated through a 5-fold cross validation, taking into account only the training data set. Therefore, the models were challenged against the blind data set, where the following parameters were measured: Sensitivity TP |100 TPzFN ??SpecificityTN |100 TNzFP??AccuracyFigure 3. ROC curves for the CS-AMPPred models against the blind data set (BS1). doi:10.1371/journal.pone.0051444.gTPzTN |100 TPzTNzFNzFP??CS-AMPPred: The Cysteine-Stabilized AMPs PredictorTable 1. Evaluation of CS-AMPPred models against the individual cysteine-stabilized AMP classes and also PDB sequences which were not used in the data sets.a-defensins1 93.33 97.78 97.78 b-defensins1 96.83 95.24 96.83 CSab defensins1 81.36 77.12 77.12 Cyclotides1 70.34 81.36 83.05 Undefined1 84.13 79.37 80.95 PDB# 80.65 82.55 81.Model Linear Polynomial Radial#Antimicrobial Peptide Classes, values computed through equation 1 (Sensitivity). Non Antimicrobial Peptides, values computed through equation 2 (Specificity), using the 1364 sequences from PDB which were not included in NS. doi:10.1371/journal.pone.0051444.tTP |100 PPV TPzFP??(TP|TN){(FP|FN) MCC pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi.