AI scans RNA ‘dark matter’ and uncovers 70,000 new viruses
Researchers have used artificial intelligence (AI) to uncover 70,500 viruses previously unknown to science1, many of them weird and nothing like known species. The RNA viruses were identified using metagenomics, in which scientists sample all the genomes present in the environment without having to culture individual viruses. The method shows the potential of AI to explore the ‘dark matter’ of the RNA virus universe.
Viruses are ubiquitous microorganisms that infect animals, plants and even bacteria, yet only a small fraction have been identified and described. There is “essentially a bottomless pit” of viruses to discover, says Artem Babaian, a computational virologist at the University of Toronto in Canada. Some of these viruses could cause diseases in people, which means that characterizing them could help to explain mystery illnesses, he says.
Previous studies have used machine learning to find new viruses in sequencing data. The latest study, published in Cell this week, takes that work a step further and uses it to look at predicted protein structures1.
The AI model incorporates a protein-prediction tool, called ESMFold, that was developed by researchers at Meta (formerly Facebook, headquartered in Menlo Park, California). A similar AI system, AlphaFold, was developed by researchers at Google DeepMind in London, who won the Nobel Prize in Chemistry this week.
Missed viruses
In 2022, Babaian and his colleagues searched 5.7 million genomic samples archived in publicly available databases and identified almost 132,000 new RNA viruses2. Other groups have led similar efforts3.
But RNA viruses evolve quickly, so existing methods for identifying RNA viruses in genomic sequence data probably miss many. A common method is to look for a section of the genome that encodes a key protein used in RNA replication, called RNA-dependent RNA polymerase (RdRp). But if the sequence that encodes this protein in a virus is vastly different from any known sequence, researchers won’t recognize it.
Shi Mang, an evolutionary biologist at Sun Yat-sen University in Shenzhen, China, and a co-author of the Cell study, and his colleagues went looking for previously unrecognized viruses in publicly available genomic samples.
They developed a model, called LucaProt, using the ‘transformer’ architecture that underpins ChatGPT, and fed it sequencing and ESMFold protein-prediction data. They then trained their model to recognize viral RdRps and used it to find sequences that encoded these enzymes — evidence that those sequences belonged to a virus — in the large tranche of genomic data. Using this method, they identified some 160,000 RNA viruses, including some that were exceptionally long and found in extreme environments such as hot springs, salt lakes and air. Just under half of them had not been described before. They found “little pockets of RNA virus biodiversity that are really far off in the boonies of evolutionary space”, says Babaian.
“It’s a really promising approach for expanding the virosphere,” says Jackie Mahar, an evolutionary virologist at the CSIRO Australian Centre for Disease Preparedness in Geelong. Characterizing viruses will help researchers to understand the microbes’ origins and how they evolved in different hosts, she says.
And expanding the pool of known viruses makes it easier to find more viruses that are similar, says Babaian. “All of a sudden you can see things that you just weren’t seeing before.”
The team wasn’t able to determine the hosts of the viruses they identified, which should be investigated further, says Mahar. Researchers are particularly interested in knowing whether any of the new viruses infect archaea, an entire branch of the tree of life for which no RNA viruses have been clearly shown to infect.
Shi is now developing a model to predict the hosts of these newly identified RNA viruses. He hopes this will help researchers to understand the roles that viruses have in their environmental niches.