Yekaterina “Kate” Shulgina, a first-year graduate student at the Graduate School of Arts and Sciences, sought a brief computational biology project to fulfill her systems biology program requirement. She was curious about how genetic code, formerly assumed to be universal, could vary and alter.
That was in 2016, and Shulgina has emerged from that short-term study with a solution to this genetic puzzle. She details it in new research co-authored with Harvard scientist Sean Eddy and published in the journal eLife.
The paper describes a novel computer software that can scan any organism’s genomic sequence and then identify its genetic code. The Codetta software can assist scientists in learning more about how the genetic code changes and accurately reading the genetic code of freshly sequenced species.
The genetic code is a collection of principles that guides cells on how to interpret three-letter nucleotide sequences into proteins, commonly referred to as the building blocks of life.
The same genetic code is used by nearly every creature, from E. coli to humans. That is why the code was formerly assumed to be unbreakable. However, scientists have uncovered a few outlier creatures that employ alternate genetic codes and have a separate set of instructions.
Codetta will flourish in this role. The tool can aid in identifying other creatures that utilize these alternative genetic codes, shedding fresh information on how genetic codes might change in the first place.
Codetta has already studied the genome sequences of over 250,000 bacteria and other single-celled creatures known as archaea for alternative genetic codes, discovering five previously unknown ones. The amino acid arginine’s coding was reassigned to a different amino acid in all five situations. It’s thought to be the first time scientists have observed this switch in bacteria, and it might point to evolutionary mechanisms that influence genetic code changes.
The study is the most extensive screening for alternative genetic codes, according to the researchers. Codetta essentially examined every bacterial and archaeal genome accessible. The program’s name is a combination of codons, three nucleotides that form portions of the code related to genetic, and the Rosetta Stone, a piece of rock carved with three languages.
Shulgina has spent the last five years researching the statistical theory behind Codetta, implementing the software, testing it, and then analyzing the genomes. It works by reading an organism’s genome and then using a database of known proteins to generate a plausible genetic code. It differs from other comparable technologies in that it can analyze genomes at a much larger scale.
Shulgina joined Eddy’s lab, which specializes in genome comparison, in 2016 after consulting him on an algorithm she was developing to decode genetic codes. No one has conducted such a comprehensive search for alternative genetic codes before.
It was exciting to see new codes since, for all we knew, Kate might perform all this work, and there would be no new ones to uncover, said Eddy, a Howard Hughes Medical Investigator. He also mentioned how the method might be used to check the numerous databases that include protein sequences.
Many protein sequences in databases now are just conceptual translations of genomic DNA sequences, according to Eddy. These protein sequences are mined for all kinds of valuable things, such as novel enzymes or gene-editing tools, and so on. You’d prefer the protein sequences to be correct, but if the organism utilizes a nonstandard coding, they’ll be mistranslated.
The researchers said the next phase in their research would be to utilize Codetta to look for various codes related to viruses, eukaryotes, and organellar genomes like mitochondria and chloroplasts. There is still a lot of diversity of life where we haven’t done this systematic screening yet, according to Shulgina.