dcyphr | Predicting novel drugs for SARS-CoV-2 using machine learning from a >10 million chemical space


There is an urgent need to identify therapeutics against COVID-19. Here, the researchers developed a machine learning method to identify candidates. First, they collected data for 65 human proteins that interact with SARS-CoV-2 proteins. Then, they trained machine learning models to predict inhibitory activity and used the models to evaluate ~100,000 FDA-approved drugs and ~14 million chemicals. The researchers screened these predictions based on toxicity and their readiness to evaporate. Chemicals that easily evaporate and that could be inhaled are proposed by the researchers, since SARS-CoV-2 infects humans through the respiratory tract. The researcher also identified candidates that act against SARS-CoV-2 in multiple ways, which are promising for future research. The researchers believe that this study could accelerate testing of repurposed drugs for short-term approval for use in humans, as well as newer drugs that may take longer to approve.


The researchers wanted to identify potential therapeutic candidates for SARS-CoV-2 using a machine learning model.


The rapidly evolving COVID-19 pandemic requires the accelerated development of therapeutic treatments. Several human proteins are targeted by the virus during infection, which means they could be targeted by therapeutics to prevent infection. A recent study identified 66 human proteins that were suitable candidates for identifying therapeutics. Many of these proteins are overexpressed in the respiratory tract. This means that inhaled therapeutics and preventive drugs may be effective for treating and preventing SARS-CoV-2 infection.


Because drug approval takes so long, repurposing drugs that already have a different use may be helpful in quickly getting effective therapeutics to the public. For example, Remdesivir has been effective in vitro and in non-human primates. Another is hydroxychloroquine, but it has been less promising in clinical trials.


However, drugs designed to treat other diseases may not be effective in respiratory organs and the nervous system, which are primarily affected by SARS-CoV-2. There have been recent efforts to explore completely new therapeutics as well as preventive drugs, especially for drugs and small molecules that interfere with viral entry and replication. 

Further identification of potential therapteutics from approved drugs, FDA-registered chemicals, or widely purchasable chemicals is necessary. Here, the researchers attempt to identify therapeutics from these lists and also calculated properties such as toxicity, vapor pressure, and partition coefficient. They used 65 human proteins targeted by SARS-CoV-2 to train machine learning models that were used to identify therapeutics. This data could be use to rapidly identify and test treatment strategies for COVID-19.


Identification of important structural features from known inhibitors of human target proteins

The researchers searched for common structures among inhibitory chemicals for the identified target proteins. Then, they used machine learning to predict chemicals that interfere with SARS-CoV-2 target proteins. Features of chemicals that best predicted their inhibitory activity included type and number of bonds, and 3D geometries. A list of these important features can be found in Table 1.


Machine learning models can successfully predict activity from chemical structures

The researchers found that the machine learning models had good overall performance using computational validation. These results suggested the models accurately predicted inhibitory activity of chemical structures and could be used to screen lists of drugs to identify potential therapeutic candidates. The machine learning model used is described in Figure 1, and the validation of the model can be found in Figure 2.


Predicting candidates for repurposing of FDA-approved drugs

The researchers used the machine learning models to predict activities of 100,000 FDA registered chemicals that could be repurposed for use against SARS-CoV-2 infection. Some of the approved drugs had high predicted activity against SARS-CoV-2. The researchers isolated drugs scoring in the top 25 that targeted multiple proteins and found a few that were promising, which can be found in Figure 3B.


Predicting volatile drug candidates from a large ~14M chemical space

The researchers wanted to predict volatile, or easily evaporated, chemicals as they could be used to target overexpressed proteins in the respiratory tract. The researchers used the machine learning models to search ~14 million commercially available chemicals for volatile candidates. They isolated the top 1% of the most promising candidates, then developed machine learning models to predict volatility (measured as vapor pressure) and toxicity. The researchers narrowed down their results, and rank ordered the top candidates that had the highest volatility and lowest toxicity. These results can be found in Figure 4.

The researchers also looked for candidates that were not volatile but still had good predicted inhibition of SARS-CoV-2 targets. These results can be found in Figures 5A and 6A.


Although a vaccine is the best long-term intervention against SARS-CoV-2, therapeutic treatments will also be necessary to control disease severity in the short term. Currently, only Remdesivir has shown potential as a repurposed drug against SARS-CoV-2. The lack of a promising vaccine or therapeutic candidate necessitates the rapid identification and development of more therapeutic candidates.


In this study, the researchers created a machine learning pipeline to try and identify a therapeutic for short- and long-term use, with the potential of it being used via inhalation. The researchers screened ~10+ million potential purchasable chemicals and predicted toxicity and volatility for chemicals using machine learning, as well as whether they have multiple targets (higher efficacy).

However, it is important to note that machine learning depends on the data that is available--so, machine learning models are always limited. Regardless, machine learning-based predictions of purchasable compounds will accelerate drug discovery and drive research on these chemicals in the future, even for drugs unrelated to COVID-19.


Data sources for machine learning

The researchers retrieved chemicals from ZINC. Bioassay data was retrieved from ChEMBL 25. Toxicity data was taken from lists from various government agencies. Vapor pressure data was taken from EPI Suite by the EPA.


Selecting optimally predictive chemical features

The researchers computed chemical features using AlvaDesc descriptors. They ranked the chemical features using cross-validated recursive feature elimination (CV-RFE). Selection bias was addressed and mitigated. The researchers selected optimal machine learning algorithms using an aggregate support vector machine (SVM), which were trained using different parameters. Extended Connectivity Fingerprints (ECFP) were used as a structural representation that is strongly associated with chemical activity, such as inhibition against SARS-CoV-2. The machine learning models were assessed using the Area under the ROC Curve (AUC).