Predicting Solubility

Drug Discovery with Graph Neural Networks — part 1

Learn How to Predict Molecular Solubility with GNNs Using Deepchem — a Deep Learning Library for Life Sciences.

Related Material

Introduction
A Special Chemistry Between Drug Development and Machine Learning
Why Molecular Solubility is Important
Approaching the Problem with Graph Neural Networks
Hands-on Part with Deepchem
About Me
References

Introduction

This article is a mix of theory behind drug discovery, graph neural networks and a practical part of Deepchem library. The first part will discuss potential applications of machine learning in drug development and then explain what molecular features might prove useful for the graph neural network model. We then dive into coding part and create a GNN model that can predict the solubility of a molecule. Let’s get started!

A Special Chemistry between Drug Development and Machine Learning

Drug development is a time-consuming process which might take decades to approve the final version of the drug [1]. It starts from the initial stage of drug discovery where it identifies certain groups of molecules that are likely to become a drug. Then, it goes through several steps to eliminate unsuitable molecules and finally tests them in real life. Important features that we look at during the drug discovery stage are ADME (Absorption, Distribution, Metabolism, and Excretion) properties. We can say that drug discovery is an optimization problem where we predict the ADME properties and choose those molecules that might increase the likelihood of developing a safe drug [2]. Highly efficient computational methods that find molecules with desirable properties speed up the drug development process and give a competitive advantage over other R&D companies.

It was only a matter of time before machine learning was applied to the drug discovery. This allowed to process molecular datasets with a speed and precision that had not been seen before [3]. However, to make the molecular structures applicable to machine learning, many complicated preprocessing steps have to be performed such as converting 3D molecular structures to 1D fingerprint vectors, or extracting numerical features from specific atoms in a molecule.

Why Molecular Solubility is Important

One of the ADME properties, absorption, determines whether the drug can reach efficiently the patient’s bloodstream. One of the factors behind the absorption is aqueous solubility, i.e. whether a certain substance is soluble in water. If we are able to predict the solubility, we can also get a good indication of the absorption property of the drug.

Approaching the Problem with Graph Neural Networks

To apply GNNs to molecular structures, we must transform the molecule into a numerical representation that can be understood by the model. It is a rather complicated step and it will vary depending on the specific architecture of the GNN model. Fortunately, most of that preprocessing is covered by external libraries such as Deepchem or RDKit.

Here, I will quickly explain the most common approaches to preprocess a molecular structure.

SMILES

SMILES is a string representation of the 2D structure of the molecule. It maps any molecule to a special string that is (usually) unique and can be mapped back to the 2D structure. Sometimes, different molecules can be mapped to the same SMILES string which might decrease the performance of the model.

Fingerprints

Fingerprints is a binary vector where each bit represents whether a certain substructure of the molecule is present or not. It is usually quite long and might fail to incorporate some structural information such as chirality.

Adjacency Matrix and Feature Vectors

Another way to preprocess a molecular structure is to create an adjacency matrix. The adjacency matrix contains information about the connectivity of atoms, where “1” means that there is a connection between them and “0” that there is none. The adjacency matrix is sparse and is often quite big which might not be very efficient to work with.

CH4 (methane) converted to adjacency matrix (top right), vector of feature vectors (middle right), and matrix of feature pair vectors (bottom right). The adjacency matrix gives us information about the connectivity between atoms, e.g. carbon C is connected to itself and all other H atoms (first row of the adjacency matrix). Individual feature vector, let’s say v0, contains information about specific atom. Individual Feature Pair Vector contains information about two neighbouring atoms and it is often a function (sum, average, etc. ) of two feature vectors of these individual atoms.

Together with this matrix, we can provide to the GNN model information about each individual atom and information about neighbouring atoms in a form of a vector. In the feature vector for each atom, there can be information about the atomic number, number of valence electrons, or number of single bonds. There is of course many more and they can fortunately be generated by RDKit and Deepchem,

A Feature Vector usually contains information about specific atom. This vector is often generated by using the functionality from the RDKit or Deepchem package.

Solubility

The variable that we are going to predict is called cLogP and is also known as octanol-water partition coefficient. Basically, the lower is the value the more soluble it is in water. clogP is a log ratio so the values range from -3 to 7 [6].

There is also a more general equation describing the solubility logS:

Solubility Equation. MP is a melting point (Celcius Degrees). **logKow** is an octanol-water partition coefficient, aka. **cLogP**

The problem with that equation is that MP is very difficult to predict from the chemical structure of the molecule [7]. All available solubility datasets contain only cLogP value and this is the value that we are going to predict as well.

Hands-on Part with Deepchem

Colab notebook that you can run by yourself is here.

Deepchem is a deep learning library for life sciences that is built upon few packages such as Tensorflow, Numpy, or RDKit. For molecular data, it provides convenient functionality such as data loaders, data splitters, featurizers, metrics, or GNN models. From my experience, it is quite troublesome to setup so I would recommend running it on the Colab notebook that I’ve provided. Let’s get started!

Firstly, we will download a Delaney dataset, which is considered as a benchmark for solubility prediction task. We then load the dataset using CSVLoader class and specify a column with cLogP data which is passed into tasks argument. In smiles_field, name of the column with SMILES string have to be specified. We choose a ConvMolFeaturizer which will create input features in a format required by the GNN model that we are going to use.

Later, we split the dataset using RandomSplitter and divide data into training and validation set. We also use a normalization for y values so they have zero mean and unit standard deviation.

In this example, we will use a GraphConvModel as our GNN models. It’s an architecture that was created by Duvenaud, et al. You can find their paper here. There are other GNN models as a part of the Deepchem package such as WeaveModel, or DAGModel. You can find a full list of the models with required featurizers here.

In this code snippet, a person R2 score is also defined. Simply speaking, the closer this value is to 1, the better is the model.

Deepchem models use Keras API. The graph_conv model is trained with the fit() function. Here you can also specify the number of epochs. We get the scores with evaluate() function. Normalizer has to be passed here because y values need to be mapped again to the previous range before computing the metric score.

And that’s all! You can do much more interesting stuff with Deepchem. They created some tutorials to show what else you can do with it. I highly suggest looking over it. You can find them here.

Thank you for reading the article, I hope it was useful for you!

About Me

I am an MSc Artificial Intelligence student at the University of Amsterdam. In my spare time, you can find me fiddling with data or debugging my deep learning model (I swear it worked!). I also like hiking :)

Here are my social media profiles, if you want to stay in touch with my latest articles and other useful content:

References

[1] Early phase drug discovery: Cheminformatics and computational techniques in identifying lead series: https://www.sciencedirect.com/science/article/abs/pii/S0968089612003598

[2] ADME Properties: https://www.cambridgemedchemconsulting.com/resources/ADME/

[3] Machine learning in chemoinformatics and drug discovery: https://www.sciencedirect.com/science/article/pii/S1359644617304695

[4] Molecular topology in qsar and drug design studies: https://www.researchgate.net/publication/236018587_Molecular_topology_in_QSAR_and_drug_design_studies

[5] Revisiting Molecular Hashed Fingerprints: https://chembioinfo.wordpress.com/2011/10/30/revisiting-molecular-hashed-fingerprints/

[6] Lecture note of Dr. Hiroshi Yamamoto: http://www.pirika.com/ENG/TCPE/logP-Theory.html#:~:text=Definition%3A,%2Dphase%20octanol%2Fwater%20system.&text=Values%20of%20Kow%20are%20thus%2C%20unitless.&text=Values%20of%20Kow%20are%20usually,(20%20or%2025'C.

[7] Predicting Aqueous Solubility — It’s Harder Than It Looks: http://practicalcheminformatics.blogspot.com/2018/09/predicting-aqueous-solubility-its.html