Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours

Clinical treatment is progressively incorporating genomic markers to personalize the treatment of cancer patients or decide their inclusion in clinical trials. Unfortunately, only a small part of the responsive patients can be currently identified before administering the treatment. Therefore, there is a need for new methods able to better discriminate between sensitive and resistant tumours. There are now sufficiently large in vitro pharmacogenomics data sets to carry out such study.

Our paper presents a first-in-kind large-scale comparison of the performance of single-gene markers and multi-gene machine learning markers across 127 drugs in the common clinical scenario where only genomic data is available. We are also the first, to our knowledge, to test genomic markers on a truly independent data set. From the results of this rigorous validation, we conclude that combining multiple gene mutations via machine learning results in better discrimination than that provided by single-gene markers in about half of these drugs (e.g. Temsirolimus, 17-AAG or Methotrexate).

In the light of these results, we discuss why clinical personalized cancer treatment only manages to treat a fraction of the patients that were expected to be helped with single-gene markers and how this could be greatly alleviated by multi-gene markers in some cases without the need of acquiring further data. To this end, we stress the importance of not only the precision of a marker, but also its recall (sensitivity) that we have found to be greatly improved by multi-gene machine-learning markers. As genomic markers continue to grow more popular in clinical settings, more attention needs to be paid to the recall of the predictive models that are used to identify responsive tumours as a part of a precision and recall oncology approach enabled by machine-learning modelling.

This study is freely available at

USR-VS: a web server for large-scale prospective virtual screening using ultrafast shape recognition techniques

Only a few small-molecule ligands are known for your target? Need a validated tool to find new binders with different chemical scaffold for this target? In that case, you definitely want to try this user-friendly webserver:

An example of its use is described here:

Further details in the associated paper:

Abstract: Ligand-based Virtual Screening (VS) methods aim at identifying molecules with a similar activity profile across phenotypic and macromolecular targets to that of a query molecule used as search template. VS using 3D similarity methods have the advantage of biasing this search toward active molecules with innovative chemical scaffolds, which are highly sought after in drug design to provide novel leads with improved properties over the query molecule (e.g. patentable, of lower toxicity or increased potency). Ultrafast Shape Recognition (USR) has demonstrated excellent performance in the discovery of molecules with previously-unknown phenotypic or target activity, with retrospective studies suggesting that its pharmacophoric extension (USRCAT) should obtain even better hit rates once it is used prospectively. Here we present USR-VS (, the first web server using these two validated ligand-based 3D methods for large-scale prospective VS. In about 2 seconds, 93.9 million 3D conformers, expanded from 23.1 million purchasable molecules, are screened and the 100 most similar molecules among them in terms of 3D shape and pharmacophoric properties are shown. USR-VS functionality also provides interactive visualization of the similarity of the query molecule against the hit molecules as well as vendor information to purchase selected hits in order to be experimentally tested.

Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening

This review overviews the relatively new topic of machine-learning scoring functions for docking. A PDF is available at

This is the abstract:

Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure-based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine-learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine-learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert-selected structural features can be strongly improved by a machine-learning approach based on nonlinear regression allied with comprehensive data-driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development.


How Reliable Are Ligand-Centric Methods for Target Fishing?

We recently published a paper on computational methods for molecular target prediction prediction. A PDF of this article can be downloaded at

Computational methods for Target Fishing (TF), also known as Target Prediction or Polypharmacology Prediction, can be used to discover new targets for small-molecule drugs. This may result in repositioning the drug in a new indication or improving our current understanding of its efficacy and side effects. While there is a substantial body of research on TF methods, there is still a need to improve their validation, which is often limited to a small part of the available targets and not easily interpretable by the user. Here we discuss how target-centric TF methods are inherently limited by the number of targets that can possibly predict (this number is by construction much larger in ligand-centric techniques). We also propose a new benchmark to validate TF methods, which is particularly suited to analyse how predictive performance varies with the query molecule. On average over approved drugs, we estimate that only five predicted targets will have to be tested to find two true targets with submicromolar potency (a strong variability in performance is however observed). In addition, we find that an approved drug has currently an average of eight known targets, which reinforces the notion that polypharmacology is a common and strong event. Furthermore, with the assistance of a control group of randomly-selected molecules, we show that the targets of approved drugs are generally harder to predict. The benchmark and a simple target prediction method to use as a performance baseline are available at

The way back machine

I was just looking at one of my favourite websites, the web archive, which is helpful to dive into the past. Or at least that part of the past that can be captured digitally (web, audio, video,…).

In the web section, the way back machine, one can check how a particular website looked like many years before. Check for instance the modern look of the EBI website today. Now have a look to this website on 6 June 1997. What a change, right? Most links are active, so you can inspect services available at that time, who was around, etc. If you wish to explore the website at some other moment in time, there is a navigation bar on the top the bar which take you to the other available captures.

Warning: this is a serious time sink!

Searching for two postdocs for my research lab in Marseille

logo-crcmI have just become a group leader at the CRCM in Marseille. Thus, I am currently searching for postdocs working in areas related to bioinformatics and drug discovery informatics.

The first post is to work on modelling cancer pharmacogenomics:

The second post will investigate new methods for drug polypharmacology prediction:

Both positions will be for two years in the first instance. The deadline for applications is Friday 17 October 2014.

Annual Symposium of MRC Fellows at BMA House

BMA-MRC-fellows-symposium Every year, the MRC organises a one-day symposium for its fellows, which is also attended by MRC panel members and staff. In addition to networking opportunities,  a number of very informative sessions are organised such as those on “Grant Writing”, “Establishing Successful Partnerships and Collaborations”, “Board and Panels – How do they Work?” or “Mentoring”. This year the symposium took place two days ago at the BMA House in London.

A recurrent topic in these meetings has been the progress on the Crick Institute. Jim Smith, research director for the Crick Institute, explained how they plan to appoint a number of early career scientists and provide them with group leader funding for 12 years. The latter is the limit of tenure of these positions, which is three years longer than similar schemes (EMBL) to facilitate a balance between career and family commitments.

As it has been the case in previous years, Professor Sir John Savill, chief executive of the MRC, closed the symposium. He highlighted computational biology as one of the areas that need to be more strongly supported in the future. When last year he made a similar comment on bioinformatics, I asked him how the MRC was planning to intensify its already existing support. His reply highlighted the work that is done at EBI and how further funding research at this type of institutions was one route to strengthen this area.

The key role of machine learning in molecular docking

jcim-TOC-originalPredicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for exploiting and analysing the outputs of docking, which is in turn an important tool in problems such as structure-based drug design. Classical scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that describe an experimentally-determined or modelled structure of a protein-ligand complex and its binding affinity. The inherent problem of this approach is in the difficulty of explicitly modelling the various contributions of intermolecular interactions to binding affinity.

New scoring functions based on machine-learning regression models, which are able to exploit effectively much larger amounts of experimental data and circumvent the need for a predetermined functional form, have already been shown to outperform a broad range of state-of-the-art scoring functions in a widely-used benchmark. Here we investigate the impact of the chemical description of the complex on the predictive power of the resulting scoring function using a systematic battery of numerical experiments. The latter resulted in the most accurate scoring function to date on the benchmark. Strikingly, we also found that a more precise chemical description of the protein-ligand complex does not generally lead to more accurate prediction of binding affinity. We discuss four factors that may contribute to this result: modelling assumptions; co-dependence of representation and regression; data restricted to the bound state; and conformational heterogeneity in data.

This paper is freely available at:

USR for the prospective identification of phenotypic hits

When expressed in its wild-type form (~50% of human tumors), the function of the tumour-supressor protein p53 can be inhibited by the Murine Double Minute 2 (MDM2) protein. Therefore, inhibition of the p53-MDM2 interaction, leading to the activation of p53 represents an attractive strategy against several types of cancers.

In this study, we have used Ultrafast Shape Recognition (USR) to screen the set of FDA-approved drugs for novel p53-MDM2 inhibitors using a potent binder of the p53-pocket on MDM2 of as template. Subsequent molecular modelling supported the potential role of the resulting USR hits as p53-MDM2 inhibitors. This was further supported by experimental tests showing that the treatment of human colon tumor cells with the top USR hit, telmisartan, led to a dose-dependent cell growth inhibition in a p53-dependent manner.

Telmisartan has a long history of safe human use as an approved anti-hypertension drug and thus may present an immediate clinical potential as a cancer therapeutic. Furthermore, it could also serve as a structurally-novel lead molecule for the development of more potent, small-molecule p53-MDM2 inhibitors against variety of cancers.

From a methodological perspective, this study demonstrates that the adopted USR-based virtual screening protocol is useful for identifying molecules with whole-cell anti-cancer activity as well as potential small molecule protein-protein interaction inhibitors.This is the fourth publication reporting a successful prospective application of USR (the others are here, here and here), with some more applications being currently prepared for publications.

The paper is available here.