This section compiles stand-alone codes and webservers implementing methods developed at my group, often in collaboration with other groups (see the associated papers for further details). These tools are organised by intended use, which is succintly explained in terms of the required input and output data.
Structure-based binding affinity prediction
RF-Score is a first-in-class machine-learning scoring function for structure-based binding affinity prediction of protein-ligand complexes. This first version can be downloaded from this link containing the RF-Score v1 codes as well as the calculated descriptors, also known as features, and measured binding affinity of each training/test complex. To learn how to use RF-Score, please read the instructions and reproduce RF-Score performance on PDBbind benchmark. The concept was explained in this paper.
Input: a set X-ray crystal structures of different protein-ligand complexes (provided). Output: predicted binding affinity of the protein-ligand complex.
We designed an improved version of RF-Score (RF-Score v2). The code to generate v2 features can be downloaded from here and it is maintained by Dr Adrian Schreyer (firstname.lastname@example.org). The associated paper can be found here.
Input: a set X-ray crystal structures of different protein-ligand complexes (the same provided for v1 can be used here as a first example). Output: predicted binding affinity of the protein-ligand complex. NB: the code calculates v2 features for each protein-ligand complex, the R script from RF-Score v1 can be combined with these features to train and test RF-Score v2.
In a subsequent paper, we study how to improve the predictive ability of Vina using Random Forest as the learning algorithm, merging Vina features with RF-Score v1 features and exploiting a very large number of crystal structures. The resulting scoring function is available here.
Input: a set X-ray crystal structures of different protein-ligand complexes (one example with expected results is provided). Output: predicted binding affinity of the protein-ligand complex. NB: the code runs the already-built scoring function on user-supplied complexes.
We also discovered that training on redocked poses rather than crystal structures is an effective way to improve prediction of binding affinity on redocked poses (this paper investigates this issue). The resulting scoring function is available here.
Input: a set redocked X-ray crystal structures of different protein-ligand complexes (RF-Score v4 is optimised for docking poses from AutoDock Vina, one example with expected results is provided). Output: predicted binding affinity of the protein-ligand complex. NB: the code runs the already-built scoring function on user-supplied complexes.
Structure-based virtual screening
iStar is a user-friendly large-scale docking webserver for prospective virtual screening accessible through this link. Further information can be found in this paper. The webserver runs iDock, an optimised version of Vina. It also permits the use of RF-Score v1 to rank the docking poses by predicted binding affinity, but this is not optimal (instead use Vina > RF-Score-VS as described below).
Input: X-ray crystal structure of the protein, definition of the binding pocket and selection of 23 million purchasable compounds. Output: the top 1000 compounds docked to the protein according to Vina or RF-Score-v1.
To excel at virtual screening, machine-learning scoring functions have to be trained on negative data. Therefore, we trained Random Forest on over 900,000 docked molecules across 102 targets, most of these molecules are docked decoys (i.e. synthetic data generated from molecules assumed inactive). This led to RF-Score-VS, which is fully described in this paper. We provide full data sets to facilitate further research in this area (http://github.com/oddt/rfscorevs) as well as ready-to-use RF-Score-VS (http://github.com/oddt/rfscorevs_binary).
Input: either the processed data to build the scoring function or the Vina-generated docked poses to scores. Output: the poses ranked by likelihood of being a true binder of the consider target. NB: As always, follow the information accompanying the released codes.
Ligand-based virtual screening
Ligand-based virtual screening methods aim at identifying molecules with a similar activity profile across phenotypic and macromolecular targets to that of a query molecule used as search template. Virtual screening using 3D similarity methods have the advantage of biasing this search toward active molecules with innovative chemical scaffolds, which are highly sought after in drug design to provide novel leads with improved properties over the query molecule (e.g. patentable, of lower toxicity or increased potency).
Ultrafast Shape Recognition (USR) has demonstrated excellent performance in the discovery of molecules with previously-unknown phenotypic or target activity, with retrospective studies suggesting that its pharmacophoric extension (USRCAT) should obtain even better hit rates once it is used prospectively. We recently introduced USR-VS (http://usr.marseille.inserm.fr/), the first web server using these two validated ligand-based 3D methods for large-scale prospective virtual screening. In about 2 seconds, 93.9 million 3D conformers, expanded from 23.1 million purchasable molecules, will be screened and the 100 most similar molecules among them in terms of 3D shape and pharmacophoric properties are shown. USR-VS functionality also provides interactive visualization of the similarity of the query molecule against the hit molecules as well as vendor information to purchase selected hits in order to be experimentally tested. The development and functionality of USR-VS is fully explained in this paper.
Predicting the whole-cell activity of a molecule
CCLP (Cancer Cell Line Profiler) is a webserver for the prediction of the whole-cell activities of a user-supplied molecule across the NCI60 panel. CCLP uses a multi-task Random Forest model trained on 941,831 activities integrating chemical structure data from 3,300 molecules and multi-omics data from 59 cancer cell lines. In addition, CCLP implements conformal prediction to provide individual prediction errors at several confidence levels. CCLP computes compound descriptors for a set of input molecules and predicts their activity across the NCI60 panel (thus, it can be used to position target-active molecules on one of the NCI60 cancer types). The output of running CCLP consists of one barplot per input compound displaying the predicted activities and errors across the NCI60 panel, as well as a text file reporting the predicted activities and errors in prediction. CCLP is freely available online at https://cclp.marseille.inserm.fr/NCI60/ and it is described in this paper.
Predicting the molecular targets of a molecule
Computational methods for Target Fishing (TF), also known as Target Prediction or Polypharmacology Prediction, can be used to discover new targets for small-molecule drugs. This paper made the following contributions: a) showed that target-centric TF methods are inherently limited by the number of targets that can possibly predict (this number is by construction much larger in ligand-centric techniques), b) propose a new benchmark to validate TF methods, which is particularly suited to analyse how predictive performance varies with the query molecule, c) found that an approved drug has currently an average of eight known targets, and d) demonstrated that the targets of approved drugs are generally harder to predict. The benchmark and a simple target prediction method to use as a performance baseline can be downloaded from here.
Input: the SMILES of the molecule for which targets want to be predicted (i.e. the query molecule). Output: the list of predicted targets for that molecule. NB: it is easy to build upon the release code (e.g. using other molecular similarity techniques) and I hope that this will facilitate further research in this area.