Machine learning may be used as an alternative to traditional alignment-based approaches to identify new anti-microbial resistance (AMR) genes, especially when microbes cannot be grown in labs. In a previous study, the researchers created an algorithm that identified AMR genes using protein characteristics, called “features,” in Gram-negative bacteria using machine learning. In this study, the researchers show that this algorithm also works for Gram-positive bacteria, identifying AMR genes that cause resistance to bacitracin and vancomycin with 87-90% accuracy. The researchers present a software tool using this algorithm and machine learning approach.
The researchers wanted to see if a machine learning algorithm they created to identify AMR genes in Gram-negative bacteria would also work in Gram-positive bacteria. They also wanted to develop a software tool using this algorithm.
Antimicrobial resistance (AMR) is when bacteria become less susceptible to antimicrobial substances. AMR in bacteria is caused by many things, such as overexpression/duplication of existing genes, mutations and/or obtaining resistance genes from neighboring bacteria. 2.8 million people a year are infected by resistant bacteria in the U.S each year, resulting in 350,000 deaths. Resistant bacteria are a threat to human health worldwide, making it important to develop efficient tools to predict AMR.
AMR is traditionally predicted by aligning sequence data with reference databases. Although this is a reliable method for highly conserved AMR genes, it can also produce many false positives even when the sequence data is very different from the reference database. Machine-learning techniques may be used as an alternative solution, predicting new AMR genes from metagenomic and pan-genomic data. However, these approaches used only a small number of genetic features to predict AMR genes and were inaccurate due to lack of feature-selection to remove irrelevant and redundant features.
The researchers recently created a game-theory-based feature selection approach that was applied to Gram-negative bacteria that accurately predicted AMR genes (93-99% prediction accuracy) that give bacteria resistance to acetyltransferase, B-lactamase, and dihydrofolate reductase. In this study, the researchers test this approach from their previous study with sequence data from Gram-positive bacteria. They combined the results of both studies to create “Prediction of Antimicrobial Resistance via Game Theory” (PARGT), a computer software designed to identify AMR genes in both Gram-positive and negative bacteria. The software can be used to detect bac and van resistance genes in Gram-positive bacteria and aac, bla, and dfr genes in Gram-negative bacteria.
Validation of PARGT
In their previous study, the researchers considered sequences for aac, bla, and dfr for Acinetobacter, Klebsiella, Campylobacter, Salmonella, and Escherichia as sequences to train the machine learning and tested the accuracy of the machine learning with sequences from Pseudomonas, Vibrio, and Enterobacter. The machine learning algorithm was able to correctly classify aac, bla, and dfr at 93%, 99%, and 97% accuracy respectively.
In this study, the researchers tested this algorithm for AMR genes in Gram-positive bacteria. They considered 25 and 52 AMR sequences for van and bac respectively using Clostridium and Enterococcus to train machine learning. They selected features of AMR genes based on their relevance, non-redundancy, and interdependence with other features. They tested the trained algorithm using 6 bac AMR sequences, 9 van AMR sequences, and 14 non-AMR sequences in Staphylococcus, Streptococcus, and Listeria.
The algorithm successfully identified all 6 bac genes (true positives) but misclassified two non-AMR genes as AMR (false positives). The sensitivity, specificity, and accuracy for the van sequences was 100%, 86%, and 90% respectively.
The algorithm successfully identified all 8 of 9 van genes (true positives) but misclassified two non-AMR genes as AMR (false positives). The sensitivity, specificity, and accuracy for the van sequences was 89%, 86%, and 87% respectively.
The AMR sequences’ NCBI accession number, protein names, and whether they were true or false positives can be seen in Tables 1 and 2 for bac and van respectively.
Performance comparison with BLASTp and Kalign tools
The researchers compared their algorithm’s performance against traditional alignment tools, specifically BLASTp and Kalign. Although the AMR sequences could be less identical to bacterial sequences for these tools to correctly identify them, this also led to a high number of false positives in which non-AMR sequences were miscategorized as being AMR. In addition, although BLASTp and Kalign perform better when the AMR sequences and bacterial sequences have a high similarity, the researchers’ algorithm performed better at low similarities, since it identifies AMR sequences using features of AMR genes rather than direct similarity.
In this study, the researchers developed the “Prediction of Antimicrobial Resistance via Game Theory” (PARGT) software package to identify AMR genes in both Gram-negative and positive bacteria. PARGT generates protein features automatically and performs predictions based on the sequence the user gives the software. In addition, users can update PARGT by inputting their own known AMR and non-AMR sequences to train the algorithm so it can predict with better accuracy. In their previous study, the researchers proved that PARGT could accurately predict AMR genes in Gram-negative bacteria. In this study, they proved that PARGT works in Gram-positive bacteria as well, with a prediction accuracy from 87-90%. PARGT performed better for bac since more sequences were available to train it, but traditional alignment tools (BLASTp and Kalign) worked for van since there was more sequence similarity.
GTDWFE algorithm for feature collection
Candidate features were collected using a literature search. The GTDWFE algorithm selected the best features based on relevance, non-redundancy, and interdependency of other features.
After identifying the best features to use, they were used to train a machine-learning model. They then selected the best model to identify bac and van datasets.
Overview of PARGT software
PARGT is an open-source software written using Python and R. PARGT uses the best features as identified by the GTDWFE algorithm to make predictions. It also allows users to add AMR and non-AMR sequences to further train the machine-learning model, which could increase PARGT’s accuracy.
Architecture of PARGT
Users input a set of known AMR and non-AMR sequences as a training dataset, generating the best features for identification in bacterial sequences. The trained machine learning model is then used to predict AMR sequences. PARGT can predict aac, bla, and dfr resistance genes in Gram-negative bacteria and bac and van resistance genes for Gram-positive bacteria.
AMR sequences were collected from the Antibiotic Resistance Genes Database, and non-AMR sequences were obtained from PATRIC.
NCBI accession numbers for proteins can be accessed at https://github.com/abu034004/PARGT.
The PARGT software package and user manual are available at https://github.com/abu034004/PARGT.