Bayesian modelling has been a popular and effective method of statistical analysis for centuries, and variations of it are particularly useful for structure-property predictions. As a rule of thumb, when the source data is a series of yes/no facts, and the observation is also of the yes/no variety, there is probably a way to adapt the original Bayes theorem to produce a probabilistic estimate of whether a previously un-observed data point is likely to be in the yes category. As an additional advantage, most implementations of naive Bayesian methods are also very fast and scale well, which means they can be routinely applied to large datasets with rapid gratification.
For the kinds of structure-property predictions that are used in early stage drug discovery and other kinds of chemical predictions,
Bayesian models are typically built from fingerprints, which are derived from the molecular structures. Because most kinds of
chemical fingerprint schemes generate a large number of inputs (typically between thousands to billions of possible fingerprint "bits"),
the most effective Bayesian variant involves a Laplacian correction, which is a fancy way of saying that contributions are summed on a
log scale. This avoids scale mismatches and floating point precision issues, but introduces the need for an additional calibration step.
These procedures are to some extent discussed in:
Sean Ekins; Alex M. Clark; Malabika Sarker: "TB Mobile: a mobile app for anti-tuberculosis molecules with known targets",
Since the TB Mobile app was upgraded to use the ECFP_6 fingerprints by implementing the same algorithm as the open source method, a number of other apps have also incorporated this technology, including the Mobile Molecular DataSheet, Approved Drugs and MolPrime+. These apps can use of Bayesian models and ECFP_6 fingerprints to make predictions of a variety of different chemical properties or biological activities. The results can be utilised as calibrated probabilities, or as visualised molecular structures with colour-coded fragments.
At the present time, use of Bayesian models is provided in 3 iOS apps from Molecular Materials Informatics, in a read-only capacity, as they are bundled with the apps. As of February 2015, work is in progress to add the ability to create models using apps, as well as import and export them, which will allow transferring models between apps, and any compatible software platforms.
There are 8 Bayesian models that are available in prepackaged form, e.g. the Mobile Molecular DataSheet app:
Estimates the likelihood that a compound has adequate solubility in water, from the point of view of being a reasonable drug candidate.
The source data is based on that used by:
Jarmo Huuskonen: "Estimation of water solubility from atom-type electrotopological state indices",
Estimates the likelihood that a probe-like molecule meets the criteria of
Chris Lipinski. The model is based on selecting
known probes and manually classifying each of them based on expert knowledge. The details and source materials are described in:
Christopher A. Lipinski; Nadia K. Litterman; Christopher Southan; Antony J. Williams; Alex M. Clark; Sean Ekins: "The parallel worlds of public and commercial bioactive chemistry data",
hERG & KCNQ1 Avoidance
Two models are based on experimental measurements for binding against the hERG and KCNQ1 targets, respectively. The activity threshold has been inverted, so that the models estimate the probability that the compound does not bind to these targets, since activity against either of these is generally unwanted for drug discovery purposes. Higher predictions suggest that the compound is more likely to be a viable drug candidate with less side effects.
Plague, Chagas, Malaria & Tuberculosis
Four sample datasets have been used to model activity against rare or neglected diseases. These should be considered as proof of concept demonstrations for using Bayesian models for mobile drug discovery purposes. The bubonic plague and chagas models are based on the outcome of a high throughput screen. The malaria model is based on binding at <10 nM against any of several different targets. The tuberculosis model is based on a phenotypic screen at 1 μM or better.
Each of the apps that supports Bayesian models has a slightly different way of exposing the prediction functionality, due to the differing use cases for the respective apps.
The MolPrime+ app is a powerful tool designed for operating on one molecule at a time. Invoking the Bayesian model prediction is very simple: either touch-and-hold any of the molecule glyphs on the main screen, or tap on a molecule that is already selected. This will bring up the menu, which is displayed as a set of icons on the left hand side. Properties are displayed to the right, which can be viewed by swiping to the left:
The information shown within the side-scrolling icon menu is for preview purposes. To open a slightly more detailed property screen, activate the properties icon (shown above), which brings up the dialog box:
The results from the Bayesian models are shown alongside other calculated properties. For each of the models, the estimated probability is shown, alongside an atom-coloured overlay. The visual representation shows an indication of which fragments contributed positively or negatively to the score: red indicates areas of the molecule whose structural fingerprints tended to show up often in negative results, green indicates those found in positive results, and intermediate values are rendered in yellow.
There are two methods in the underlying algorithm that have yet to be described in detail in the literature: (1) the calibration method for transforming the log-scale Bayesian predictions into probability-like estimates, and (2) the manner in which the atom-specific contributions are visually encoded. The calibration is currently being written as a manuscript draft, while the visual display is still experimental, and will likely be refined before being formally documented.
Bayesian models are available in the Approved Drugs app within the Custom tab, which allows user-defined molecules to be imported or drawn. The models are listed mnemonically along the top: Aq for aqueous solubility, hE for hERG avoidance, Kq for KCNQ1 avoidance and Lp for Lipinski probelikeness. Tapping any of these buttons allows the model to be added and shown for each of the compounds listed:
The resulting band shown to the right of the structure indicates the prediction from the model, where red is for low, green for high, and yellow for intermediate. It is possible to switch on any combination of models, e.g.:
The bands are colour-coded in the same order as onscreen, i.e. top-down corresponds to left-right.
More detail can be obtained by tapping on any of the structures. The Data tab shows calculated probability estimates for each of the models:
The Models tab shows a visual overlay for each of the Bayesian predictions, and can be scrolled by swiping left-to-right:
The Mobile Molecular DataSheet app invokes Bayesian predictions in two modes: for individual molecules, and tables of molecules. The single molecule visualisation is very similar to MolPrime+:
For operating on collections of molecules, the interface is merged with the property calculation feature. Any of the available Bayesian models can be included in the calculation:
When used to calculate predictions for a whole datasheet, the numeric probabilies are stored in a column created for the occasion. The predictions persist when the datasheet is viewed or exported, in contrast to the single molecule properti visualisation, which is ephemeral.
Predefined Bayesian models can be used to very easily view predicted properties or visualise atom contributions to predictions. At the present time apps are only able to use predictions from predefined models, but in the future they will be able to create and share models.