U of I researchers trace genetic code’s origins to early protein structures | College of Agricultural, Consumer & Environmental Sciences | Illinois
Just published in the Journal of Molecular Biology is a paper by three research scientists at the University of Illinois at Urbana-Champaign that will spread despondency among creationists—at least among those with the courage to read it and the understanding to grasp its significance. Another of their favourite god-shaped gaps has just been slammed shut.
The gap in question is the long-standing mystery of how the genetic code arose through natural processes, without the intervention of a supernatural intelligence. Creationists have long claimed that the genetic code is analogous to a computer program—something they assume must imply a programmer. They bolster this with the usual straw-man arguments and hand-waving about statistical impossibility, declaring that such complexity could not have arisen “by chance alone.”
Of course, that was never more than the familiar argument from ignorant incredulity coupled with the false dichotomy fallacy: because we don’t yet know something, it must have been their particular god. Not any of the other gods, of course—because those aren’t real.
What we do know is that the earliest forms of life appeared on Earth about 3.8 billion years ago, while the genetic code itself did not appear until some 800 million (0.8 billion) years later. Time, therefore, was not a limiting factor: there was no plan, no deadline, and no external programmer. That fact alone should give creationists cause for concern because any decent intelligent designer, especially an omniscient one, would not have taken 0.8 billion years to invent the genetic code.
Now, this team of researchers has produced a plausible explanation (and it only needs to be plausible to refute the claim that no explanation is possible). Their study is based on an analysis of 4.3 billion dipeptide sequences across 1,561 proteomes, representing organisms from all three domains of life: Archaea, Bacteria, and Eukarya. (Proteomes are the complete sets of proteins expressed in an organism.)
To rub salt into creationist wounds, the evidence points to the genetic code having emerged through an evolutionary “bootstrapping” process, in which improvements in the code itself led to improvements in the proteins that controlled the very process of coding — an elegant feedback loop with no need for divine intervention.
How Do We Know When the Genetic Code Arose? Scientists don’t have fossils of the first genetic codes, so they use two main lines of evidence:The research and its broader significance are explained in a University of Illinois Urbana-Champaign news release by Marianne Stein.
- Geological evidence
- Rocks from Greenland and Australia show signs of life as early as 3.8–3.7 billion years ago (chemical “fingerprints” of biological carbon and stromatolite structures).
- This marks the likely origin of life.
- Molecular evidence
- By comparing the genes and proteins shared across all living things, scientists can reconstruct the features of the Last Universal Common Ancestor (LUCA).
- Molecular clock analyses (which estimate how fast genetic changes accumulate) suggest LUCA lived about 3.0 billion years ago.
- LUCA already had a fully developed genetic code and translation machinery.
Putting it together
- If life began ~3.8 billion years ago, but LUCA appeared ~3.0 billion years ago, the genetic code must have evolved during the ~0.8 billion years in between.
- This long span allowed primitive coding systems to gradually refine into the near-universal triplet code we see today.
What is the Genetic Code?
The genetic code is the set of rules by which information stored in DNA is translated into proteins, the working molecules of life. DNA is made up of four bases (A, T, C, G). The Triplet Code
- Bases are read in groups of three called codons.
- Each codon specifies one amino acid (or a “stop” signal).
- With four bases, there are 64 possible codons, enough to code for all 20 amino acids plus start/stop signals.
- The code is redundant: several codons can specify the same amino acid, helping reduce the impact of mutations.
From DNA to Protein
- Transcription: DNA is copied into messenger RNA (mRNA).
- Translation: The mRNA is read by the ribosome, a molecular machine.
- tRNA (transfer RNA): tRNAs carry amino acids and have anticodons that pair with mRNA codons.
- Protein synthesis: The ribosome joins the amino acids in sequence, producing a chain that folds into a functional protein.
Why It Matters
This universal triplet code is found in all known life, pointing to a common origin. Its structure and redundancies show signs of evolution, not design.
U of I researchers trace genetic code’s origins to early protein structures
Genes are the building blocks of life, and the genetic code provides the instructions for the complex processes that make organisms function. But how and why did it come to be the way it is? A recent study from the University of Illinois Urbana-Champaign sheds new light on the origin and evolution of the genetic code, providing valuable insights for genetic engineering and bioinformatics.
We find the origin of the genetic code mysteriously linked to the dipeptide composition of a proteome, the collective of proteins in an organism.
Professor Gustavo Caetano-Anollés, corresponding author
Department of Crop Sciences
Carl R. Woese Institute for Genomic Biology, and Biomedical and Translation Sciences
Carle Illinois College of Medicine
University of Illinois Urbana-Campaign, Urbana, IL, USA.
Caetano-Anollés’ work focuses on phylogenomics, which is the study of evolutionary relationships between the genomes of organisms. His research team previously built phylogenetic trees mapping the evolutionary timelines of protein domains (structural units in proteins) and transfer RNA (tRNA), an RNA molecule that delivers amino acids to the ribosome during protein synthesis. In this study, they explored the evolution of dipeptide sequences (basic modules of two amino acids linked by a peptide bond), finding the histories of domains, tRNA, and dipeptides all match.
Life on Earth began 3.8 billion years ago, but genes and the genetic code did not emerge until 800,000 [sic]* million years later, and there are competing theories about how it happened.
Some scientists believe RNA-based enzymatic activity came first, while others suggest proteins first started working together. The research of Caetano-Anollés and his colleagues over the past decades supports the latter view, showing that ribosomal proteins and tRNA interactions appeared later in the evolutionary timeline.
Life runs on two codes that work hand in hand, Caetano-Anollés explained. The genetic code stores instructions in nucleic acids (DNA and RNA), while the protein code tells enzymes and other molecules how to keep cells alive and running. Bridging the two is the ribosome, the cell’s protein factory, which assembles amino acids carried by tRNA molecules into proteins. The enzymes that load the amino acids onto the tRNAs are called aminoacyl tRNA synthetases. These synthetase enzymes serve as guardians of the genetic code, monitoring that everything works properly.
Why does life rely on two languages – one for genes and one for proteins? We still don’t know why this dual system exists or what drives the connection between the two. The drivers couldn’t be in RNA, which is functionally clumsy. Proteins, on the other hand, are experts in operating the sophisticated molecular machinery of the cell.
Professor Gustavo Caetano-Anollés.
The proteome appeared to be a better fit to hold the early history of the genetic code, with dipeptides playing a particularly significant role as early structural modules of proteins. There are 400 possible dipeptide combinations whose abundances vary across different organisms.
The research team analyzed a dataset of 4.3 billion dipeptide sequences across 1,561 proteomes representing organisms from the three superkingdoms of life: Archaea, Bacteria, and Eukarya. They used the information to construct a phylogenetic tree and a chronology of dipeptide evolution. They also mapped the dipeptides to a tree of protein structural domains to see if similar patterns arose.
In previous work, the researchers had built a phylogeny of tRNA that helped provide a timeline of the entry of amino acids into the genetic code, categorizing amino acids into three groups based on when they appeared. The oldest were Group 1, which included tyrosine, serine, and leucine, and Group 2, with 8 additional amino acids. These two groups were associated with the origin of editing in synthetase enzymes, which corrected inaccurate loading of amino acids, and an early operational code, which established the first rules of specificity, ensuring each codon corresponds to a single amino acid. Group 3 included amino acids that came later and were linked to derived functions related to the standard genetic code.
The team had already demonstrated the co-evolution of synthetases and tRNA in relation to the appearance of amino acids. Now, they could add dipeptides to the analysis.
We found the results were congruent. Congruence is a key concept in phylogenetic analysis. It means that a statement of evolution obtained with one type of data is confirmed by another. In this case, we examined three sources of information: protein domains, tRNAs, and dipeptide sequences. All three reveal the same progression of amino acids being added to the genetic code in a specific order.
Professor Gustavo Caetano-Anollés.
Another novel finding was duality in the appearance of dipeptide pairs. Each dipeptide combines two amino acids, for example, alanine-leucine (AL), while a symmetrical one — an anti-dipeptide — has the opposite combination of leucine-alanine (LA). The two dipeptides in a pair are complementary; they can be considered mirror images of each other.
We found something remarkable in the phylogenetic tree. Most dipeptide and anti-dipeptide pairs appeared very close to each other on the evolutionary timeline. This synchronicity was unanticipated. The duality reveals something fundamental about the genetic code with potentially transformative implications for biology. It suggests dipeptides were arising encoded in complementary strands of nucleic acid genomes, likely minimalistic tRNAs that interacted with primordial synthetase enzymes.
Professor Gustavo Caetano-Anollés.
Dipeptides did not arise as arbitrary combinations but as critical structural elements that shaped protein folding and function. The study suggests that dipeptides represent a primordial protein code emerging in response to the structural demands of early proteins, alongside an early RNA-based operational code. This process was shaped by co-evolution, molecular editing, catalysis, and specificity, ultimately giving rise to the synthetase enzymes, the modern guardians of the genetic code.
Uncovering the evolutionary roots of the genetic code deepens our understanding of life’s origin, and it informs modern fields such as genetic engineering, synthetic biology, and biomedical research.
Synthetic biology is recognizing the value of an evolutionary perspective. It strengthens genetic engineering by letting nature guide the design. Understanding the antiquity of biological components and processes is important because it highlights their resilience and resistance to change. To make meaningful modifications, it is essential to understand the constraints and underlying logic of the genetic code.
Professor Gustavo Caetano-Anollés.
Publication:* Note: This 800,000 million (800 billion) years is clearly in error. It should be 800 million.
Tracing the Origin of the Genetic Code and Thermostability to Dipeptide Sequences in Proteomes.As so often, what creationists have claimed as evidence of their god turns out, once the gap is explored, to be nothing more than a reflection of our temporary lack of knowledge. Science, through patient investigation and evidence-based reasoning, has once again advanced into territory once thought “unknowable,” showing that no supernatural explanation is required.
Minglei Wang, M. Fayez Aziz, Gustavo Caetano-Anollés
Highlights
- Billions of dipeptide sequences in 1,561 proteomes offer insight into code emergence.
- An evolutionary chronology of dipeptides supports an early operational RNA code.
- Genetic code entry was congruent with tRNA and synthetase coevolutionary history.
- Synchronous dipeptide-antidipeptide appearance uncovered an ancestral genetic duality.
- The timeline revealed protein thermostability was a late evolutionary development.
Abstract
The safekeeping of the genetic code has been entrusted to interactions between aminoacyl-tRNA synthetases and their cognate tRNA. In a previous phylogenomic study, chronologies of RNA substructures, protein domains and dipeptide sequences uncovered the early emergence of an ‘operational’ code in the acceptor arm of tRNA prior to the implementation of the 'standard' genetic code in the anticodon loop of the molecule. This history likely originated in peptide–synthesizing urzymes but was driven by episodes of molecular co-evolution and recruitment that promoted flexibility and protein folding. Here, we show that dipeptide sequences offer deep-time insights into the chronology of code emergence. A phylogeny describing the evolution of the repertoire of 400 canonical dipeptides reconstructed from an analysis of 4.3 billion dipeptide sequences across 1,561 proteomes revealed the overlapping temporal emergence of dipeptides containing Leu, Ser and Tyr, followed by those containing Val, Ile, Met, Lys, Pro, and Ala, all of which supported the operational RNA code. This strengthened a timeline of genetic code entry. The synchronous appearance of dipeptide–antidipeptide sequences along the dipeptide chronology supported an ancestral duality of bidirectional coding operating at the proteome level. Tracing determinants of thermal adaptation showed protein thermostability was a late evolutionary development and bolstered an origin of proteins in the mild environments typical of the Archaean eon. Our study uncovers a hidden evolutionary link between a protein code of dipeptides – arising from the structural demands of emerging proteins – and an early operational code shaped by co-evolution, editing, catalysis and specificity.
Introduction
Peptides and polypeptides are generally linear heteropolymer chains of amino acids covalently linked together by peptide bonds. Peptides contain fewer than 50 amino acid residues. They make up bioactive compounds, including neurotransmitters, hormones, and antimicrobials. Ribosomally synthesized and post-translationally modified peptides (RiPPs) for example represent major classes of natural products present in all superkingdoms of life [1]. Similarly, large families of non-ribosomal peptides are synthesized by non-ribosomal peptide synthases (NRPSs) from more than 500 different amino acid and fatty acid monomers [2]. They produce a wide array of bioactive compounds. Even synthetic fragments of comparable length from known enzymes (sometimes encoded by complementary strands of a same gene) exhibit very substantial catalysis [[3], [4], [5], [6], [7]]. Their activities suggest an early participation of peptides in events leading to genetic encoding. Polypeptides, on the other hand, are larger and more structurally complex. Proteins for example are polypeptides 5–500 kDa in mass that make up nanoparticles 2–10 nm in diameter [8]. These macromolecules consist of one or more chains 11-to-34,350 residues long synthesized by the ribosomal translation machinery from a repertoire of 22 genetically encoded (proteinogenic) L-α-amino acids, including the 20 standard amino acids as well as selenocysteine and pyrrolysine, both of which are incorporated by specialized translation mechanisms [8]. Proteins fold cooperatively into globular, fibrous or membrane forms exhibiting compact and stable atomic three-dimensional (3D) conformations [10], although some regions or complete chains may remain intrinsically disordered [11]. Remarkably, even short peptides adopt residue- and size-dependent conformations that deviate from random coils [12]. Some are highly structured and can readily bind to substrates such as dNTPs and duplex DNA suggesting they can harbor functional substrate-binding sites [3]. A frustrated landscape of structure and disorder therefore percolates the folding process and is expected to impact protein evolution [13].
The origin and evolution of peptides and proteins, while still not fully understood, have been increasingly clarified with evolutionary chronologies and time-dependent networks reconstructed with structural phylogenomic methodologies [10,14]. An initial focus has been on protein domains, which serve as the structural, functional and evolutionary building blocks of proteins. Another focus has been on protein loops, which are the elemental architects of protein structure. Chronologies describe the ‘time of origin’ (age) of individual domain and loop structures [15,16]. Evolving networks dissected recruitment processes responsible for the birth of domains [17,18] and domain organization [19]. Here, we extend our phylogenomic study of the protein world to the most elemental building block of protein chemistry – the peptide bond. Each bond defines a 2-mer component of the polypeptide chain. For simplicity, we refer to these pairs of sequential residues as dipeptides, acknowledging that each dipeptide represents one of ∼400 possible canonical peptide bond ‘types’ that constitute protein sequence and structure.
More than a decade ago, Caetano-Anollés and colleagues used chronologies of dipeptides, domains and tRNA structures to study the emergence of amino acid charging and codon specificities [20]. A timeline of genetic code expansion uncovered the co-evolution of aminoacyl-tRNA synthetases (aaRSs) and tRNA structures, as well as the early appearance of an ‘operational’ RNA code driven by editing specificities. This primordial code, intimated long ago by the discovery of determinants of specificity in the acceptor arm of tRNA [21,22], evolved into modern genetics through episodes of recruitment. Remarkably, the amino acid and dipeptide compositions of single-domain proteins that appeared before the standard code suggested that genetics emerged through co-evolutionary interactions between polypeptides and nucleic acid cofactors favoring protein flexibility and folding [20]. Furthermore, phylogenomic data provided support to the hypothesis that primordial polypeptides were assembled through ligation in a biosynthetic cycle facilitated by archaic aaRSs that were homologous in structure to the catalytic domains of tyrosyl-tRNA and seryl-tRNA synthetases. Consequently, studying the evolution of dipeptides may reflect an early evolutionary accretion process in which dipeptides gradually formed structured polypeptides. We now integrate these initial studies with direct phylogenomic reconstructions of the history of dipeptides in the protein world. We reveal significant phylogenetic signatures in dipeptide composition, enabling the reconstruction of a tree of dipeptide sequences (ToDS) and a dipeptide chronology that confirms our inferences about the origin and evolution of the genetic code. Additionally, we show that domains exhibit dipeptide compositions that are biased and follow a biphasic evolutionary pattern typical of processes of module creation. These biases shaped structural innovation in evolution of protein domain makeup. Finally, we trace features of protein thermostability along dipeptide history and reveal thermal adaptation was a late evolutionary deployment. The central hypothesis driving these studies is that a catalytic code embedded in promiscuous primordial enzymes (urzymes) was gradually replaced by an emergent genetic code in a transition that fostered evolutionary refinement of protein–RNA interaction specificities, increased protein flexibility, and the development of novel and enhanced molecular functions.
Wang, Minglei; Aziz, M. Fayez; Caetano-Anollés, Gustavo
Tracing the Origin of the Genetic Code and Thermostability to Dipeptide Sequences in Proteomes
Journal of Molecular Biology (2025) 169396 DOI: 10.1016/j.jmb.2025.169396.
Copyright: © 2025The authors.
Published by Elsevier B.V. Open access.
Reprinted under a Creative Commons Attribution 4.0 International license (CC BY 4.0)
The genetic code, far from being the inscrutable signature of a cosmic programmer, now looks increasingly like the product of natural evolutionary processes—a system that bootstrapped itself step by step over hundreds of millions of years, refining as it went. That is precisely what we would expect from evolution, and the opposite of what creationists predict.
Another god-shaped gap has closed. And with every one that does, the space left for creationist superstition shrinks further, while the evidence for a natural, evolutionary origin of life continues to grow.
Advertisement
What Makes You So Special? From The Big Bang To You
Ten Reasons To Lose Faith: And Why You Are Better Off Without It
All titles available in paperback, hardcover, ebook for Kindle and audio format.
Prices correct at time of publication. for current prices.
No comments :
Post a Comment
Obscene, threatening or obnoxious messages, preaching, abuse and spam will be removed, as will anything by known Internet trolls and stalkers, by known sock-puppet accounts and anything not connected with the post,
A claim made without evidence can be dismissed without evidence. Remember: your opinion is not an established fact unless corroborated.