Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours

Clinical treatment is progressively incorporating genomic markers to personalize the treatment of cancer patients or decide their inclusion in clinical trials. Unfortunately, only a small part of the responsive patients can be currently identified before administering the treatment. Therefore, there is a need for new methods able to better discriminate between sensitive and resistant tumours. There are now sufficiently large in vitro pharmacogenomics data sets to carry out such study.

Our paper presents a first-in-kind large-scale comparison of the performance of single-gene markers and multi-gene machine learning markers across 127 drugs in the common clinical scenario where only genomic data is available. We are also the first, to our knowledge, to test genomic markers on a truly independent data set. From the results of this rigorous validation, we conclude that combining multiple gene mutations via machine learning results in better discrimination than that provided by single-gene markers in about half of these drugs (e.g. Temsirolimus, 17-AAG or Methotrexate).

In the light of these results, we discuss why clinical personalized cancer treatment only manages to treat a fraction of the patients that were expected to be helped with single-gene markers and how this could be greatly alleviated by multi-gene markers in some cases without the need of acquiring further data. To this end, we stress the importance of not only the precision of a marker, but also its recall (sensitivity) that we have found to be greatly improved by multi-gene machine-learning markers. As genomic markers continue to grow more popular in clinical settings, more attention needs to be paid to the recall of the predictive models that are used to identify responsive tumours as a part of a precision and recall oncology approach enabled by machine-learning modelling.

This study is freely available at


USR-VS: a web server for large-scale prospective virtual screening using ultrafast shape recognition techniques

Only a few small-molecule ligands are known for your target? Need a validated tool to find new binders with different chemical scaffold for this target? In that case, you definitely want to try this user-friendly webserver:

An example of its use is described here:

Further details in the associated paper:

Abstract: Ligand-based Virtual Screening (VS) methods aim at identifying molecules with a similar activity profile across phenotypic and macromolecular targets to that of a query molecule used as search template. VS using 3D similarity methods have the advantage of biasing this search toward active molecules with innovative chemical scaffolds, which are highly sought after in drug design to provide novel leads with improved properties over the query molecule (e.g. patentable, of lower toxicity or increased potency). Ultrafast Shape Recognition (USR) has demonstrated excellent performance in the discovery of molecules with previously-unknown phenotypic or target activity, with retrospective studies suggesting that its pharmacophoric extension (USRCAT) should obtain even better hit rates once it is used prospectively. Here we present USR-VS (, the first web server using these two validated ligand-based 3D methods for large-scale prospective VS. In about 2 seconds, 93.9 million 3D conformers, expanded from 23.1 million purchasable molecules, are screened and the 100 most similar molecules among them in terms of 3D shape and pharmacophoric properties are shown. USR-VS functionality also provides interactive visualization of the similarity of the query molecule against the hit molecules as well as vendor information to purchase selected hits in order to be experimentally tested.

Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

This review overviews the relatively new topic of machine-learning scoring functions for docking. A PDF is available at

This is the abstract:

Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure-based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine-learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine-learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert-selected structural features can be strongly improved by a machine-learning approach based on nonlinear regression allied with comprehensive data-driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development.


How Reliable Are Ligand-Centric Methods for Target Fishing?

We recently published a paper on computational methods for molecular target prediction prediction. A PDF of this article can be downloaded at

Computational methods for Target Fishing (TF), also known as Target Prediction or Polypharmacology Prediction, can be used to discover new targets for small-molecule drugs. This may result in repositioning the drug in a new indication or improving our current understanding of its efficacy and side effects. While there is a substantial body of research on TF methods, there is still a need to improve their validation, which is often limited to a small part of the available targets and not easily interpretable by the user. Here we discuss how target-centric TF methods are inherently limited by the number of targets that can possibly predict (this number is by construction much larger in ligand-centric techniques). We also propose a new benchmark to validate TF methods, which is particularly suited to analyse how predictive performance varies with the query molecule. On average over approved drugs, we estimate that only five predicted targets will have to be tested to find two true targets with submicromolar potency (a strong variability in performance is however observed). In addition, we find that an approved drug has currently an average of eight known targets, which reinforces the notion that polypharmacology is a common and strong event. Furthermore, with the assistance of a control group of randomly-selected molecules, we show that the targets of approved drugs are generally harder to predict. The benchmark and a simple target prediction method to use as a performance baseline are available at

The way back machine

I was just looking at one of my favourite websites, the web archive, which is helpful to dive into the past. Or at least that part of the past that can be captured digitally (web, audio, video,…).

In the web section, the way back machine, one can check how a particular website looked like many years before. Check for instance the modern look of the EBI website today. Now have a look to this website on 6 June 1997. What a change, right? Most links are active, so you can inspect services available at that time, who was around, etc. If you wish to explore the website at some other moment in time, there is a navigation bar on the top the bar which take you to the other available captures.

Warning: this is a serious time sink!

Searching for two postdocs for my research lab in Marseille

logo-crcmI have just become a group leader at the CRCM in Marseille. Thus, I am currently searching for postdocs working in areas related to bioinformatics and drug discovery informatics.

The first post is to work on modelling cancer pharmacogenomics:

The second post will investigate new methods for drug polypharmacology prediction:

Both positions will be for two years in the first instance. The deadline for applications is Friday 17 October 2014.