FeaturedHealth

New AI model learns DNA’s hidden language

The DNA (Deoxyribonucleic Acid) within the cells of our body contains the basic underlying information needed to sustain life. Understanding how this information is stored and structured has been one of the greatest scientific challenges of the last century. With GROVER, a new large language model (LLM) — machine learning models that can comprehend and generate human language text) trained on human DNA — researchers can now attempt to decode the complex information hidden in our genome.

Since the discovery of the double helix nature of DNA in 1953, scientists have sought to understand the information encoded within it. Today, it is clear that the information hidden in the DNA is multilayered. Only 1-2 percent of the genome consists of genes — the sequences that code for proteins.

DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, most sequences serve multiple functions at once. Currently, we do not understand the meaning of most of the DNA. When it comes to understanding the non-coding regions of the DNA, it seems that we have only started to scratch the surface.

Artificial Intelligence (AI)I and LLMs such as ChatGPT have transformed our understanding of language. Trained exclusively on text, the large language models developed the ability to use the language in many contexts. Since DNA is the code of life, GROVER treats it like a language, and scientists could use it to extract biological meaning from the DNA.

GROVER learned the rules of DNA, which in language terms would be equivalent to learning the grammar, syntax, and semantics of a language. In the case of DNA this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences. Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA.

Not only was GROVER able to accurately predict the DNA sequences but could also extract contextual information that has biological meaning, e.g., identify gene promoters or protein binding sites on DNA. GROVER also learns processes that are generally considered to be ‘epigenetic’ (regulatory processes taking place on top of the DNA rather than being encoded.

DNA resembles language. It has four letters that build sequences and the sequences carry a meaning. But, unlike a language, DNA has no defined words, instead, it has four nucleotide bases — Adenine (A), Cytosine (C), Guanine (G), and Thymine (T) — that function as the fundamental units of the genetic code comprising a gene.

To train GROVER, the team had to first create a DNA dictionary. They used a method employed in data compression algorithms, by analyzing the whole genome and looking for combinations of letters that occur most often.They started with two letters and went over the entire DNA, again and again, to build it up to the most common multi-letter combinations. In this way, in about 600 cycles, the scientists were able to fragment the DNA into ‘words’ that let GROVER perform the best when it comes to predicting the next sequence.

GROVER promises to unlock the different layers of genetic code. DNA holds key information on what makes us human, our disease predispositions, and our responses to treatments. Understanding the rules of DNA through a language model could help uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine in future.




Read Today's News TODAY...
on our Telegram Channel
click here to join and receive all the latest updates t.me/thetimeskuwait






Back to top button