Pandemic coronavirus mutations are more than replacing a letter in the viral genome
Deletions and insertions also have an evolutive interest.
Sketch of the SARS-CoV-2 genome structure.
The long genome of the SARS-CoV-2 (SC2) pandemic virus contains near 30,000 letters. Each letter represent a nucleotide. There are two distinct parts in the genome: a sector that sequences 16 non-structural proteins (nsp 1–16) and a second sequencing four structural proteins. They make up the anatomy of the virus. About structural proteins, remember S, E, M, and N. They are the first letter of Spike, Envelope, Membrane, and Nucleocapsid.
The S gene encodes the spike (glycol)protein. It has two subunits (S1 and S2) containing 16 domains (a domain is a genomic area, site, or fragment). The Receptor-Binding Domain (RBD) and the N-Terminal Domain (NTD) are of interest now. The S protein, encoded by the S gene, serves the coronavirus to bind to the cell membrane (ACE2 receptor). It is the mechanism for initiating cell invasion.
The RBD site is in the S1 subunit. It is a structure of only 273 amino acids. Nevertheless, the changes produced upon binding RBD to ACE2 in SC2 are potent: 10–20-fold more than other coronaviruses (i.e., SARS-CoV). SARS-CoV is the coronavirus that caused the 2002–2003 epidemic in bats, civets, and humans.

Protein S is a primary target of the antibody-mediated immune response. Both natural antibody-mediated binding and neutralization, and vaccinal and therapeutic. It is easy to understand that the RBD domain of the S protein is a preferred site for the appearance of many mutations. There are many viral variants still in crescendo (“surging”) in nature: from D614G, alpha, and some more until the last omicron. In short, RBD means a great chance of mutations. But one should not forget its neighbor, the NTD domain.
Mutations do not mean only nucleotide substitutions.
A mutation is a genetic change. It is an error that occurs during virus replication. Mutations in RNA viruses are a biological rule. They happen by chance and can be beneficial, detrimental, or indifferent to the virus. Likewise, the viral mutational change can be helpful or harmful to the host (here, the human being).
It is easy to assume the following: what benefits the fitness of the virus will harm the human. For the virus side, there can be greater infectivity, virulence, or capacity to evade immunity. Therefore, there will be more contagious, severe, or lethal infections and less effectiveness of vaccines and treatments with antibodies for humans.
It happens because some changes in the viral genome allow the virus to adapt better (by Darwin’s law) and continue replicating as more effective dissemination. Suitable proof of this fact is the several mutant lineages or variants (D614G, alpha, delta, and omicron) that have swept the planet in several expansive waves.
The damage to the virus means that one or more mutations prevent it from perpetuating itself. In other words, the changes block or cut off viral replication. Finally, mutations that neither improve nor worsen virus survival are indifferent. They are neutral mutations. Neutral mutations make up most natural replicative changes.
Mechanisms of mutation production.
The mechanisms by which mutations occur are threefold: substitution, deletion, and insertion.
- Substitution involves the change of one or more viral genome’s letters (nucleotides). Therefore, of the encoded amino acids. Because every three nucleotides (named a codon) generate the synthesis of one amino acid, a substitution is synonymous when there is no modification in the viral protein sequence. It is non-synonymous if the change of the codon (the triplet of nucleotides) affects the protein (its structure and function).
- Deletion is the loss of one or more nucleotides during viral replication. In addition, there may be a loss of a single codon (the loss of one amino acid). Also, it is a so-called frameshift mutation. That means that all subsequent codons in the replicative read-through also change.
- Insertion involves adding one or more letters into the original genetic sequence. There is no replacing or losing. Therefore, new letters will appear in the viral genome. Therefore, there will be different amino acids in the synthesized protein.
Quantity and quality of mutations.
In the pandemic coronavirus SARS-CoV-2 (SC2), there are substitutions, deletions, and insertions (mutations).
A polymerase enzyme regulates this mechanism. About 25 mutations occur each year in SC2. It means an average of about two changes per month. As of April 2021, there were 1.57 million genomes sequenced in GISAID (an international genomes repository). They proceeded from 187 countries and territories and detected 857 mutations with 100 sequences. Of the total changes, 816 (95.2%) were substitutions, and 37 (4.3%) were deletions. Insertions are very rare (0.4%).
Substitutions dominate in changing numbers to deletions and insertions. So, the quantity matters, but the quality must also be important. The omicron variant has clarified this. It exhibits many mutations (more than 50 throughout all viral genomes). Most of them (35 mutations, being 32 non-synonymous) and located in the S gene. Above all, the RBD has ten mutations; some are new, but researchers know others (e.g., K417N, T478K, N501Y, and E484A) because they existed in some variants of concern (VOCs).
Besides their striking quantity, in omicron is also interesting to consider the type or class of mutations — their quality.
The S gene is an antibodies target and a mutation seeding site.
The accumulation of mutations in the S gene is the rule. In general, most modifications are of evolutionary significance for SC2. But, also, it is of epidemiological and clinical importance for humans. In early 2020, Scientifics detected the first mutation of relevance in the spike at the S2 domain of the S gene. It was D614G. It increased the contagiousness of SC2, increasing the interaction of the RBD with the human ACE2 membrane receptor.
Increasing the viral load in the upper respiratory tract, D614G enhances viral transmission. For his reason, in the beginning, D614G became the predominant mutated coronavirus worldwide. It happened in a few weeks. And since then, it has existed in all variants.
But this was only the beginning of the mutational dance. Throughout 2020 and in early 2021, other significant mutations appeared. All of them have been modifying the fitness or capacity of the coronavirus. This aptitude made it more contagious (increased reproduction rates), aggressive (higher viral load), and able to evade the human immune system and withstand some vaccines.
All substitutions define each of the VOCs. The primary individual changes are alpha variant (N501Y mutation), beta and gamma (N484K and K417N/T), delta (L452R), and omicron.
Deletions have evolutionary and pathogenic significance.
Studying the role of deletions is very interesting. If they occur in a specific epitope region, they can change the immune response (immune evasion). An epitope is a particular area on the surface of an antigen that interacts with the specific antibodies to which it binds. As a rule, an antigen has several different epitopes to bind.
Four of the five variants of concern (alpha, beta, delta, and omicron) also have deletions or loss of nucleotides, besides too many substitutions (well beyond those cited).
Deletions are usually in the N-terminal domain of the viral genome. It contains deletion-prone regions allowing the virus to escape antibody neutralization. They are Recurrent Deletion Regions (RDRs). There were four RDRs, but there are more.

Let put a look at the deletions of interest and the antigenic minimalism.
An article reported that the prevalence of NTD deletions increased when increasing the cases worldwide. It happened with some relevance in India and Chile. It also showed NTD deletions in samples from previous infected Covid-19 patients (reinfections). And in subjects who were already vaccinated (with Pfizer and Johnson&Johnson vaccines).
In India, one NTD deletion increased more than 13-fold from February to April 2021, making it the fifth recurrent deletion region (RDR). In Chile, on the same date, a deletion associated with a substitution of NTD increased 38.4-fold in prevalence. It was an outbreak independent of the two predominant variants (alpha and gamma). This deletion was present in one of the two main variants circulating in the country (C.37 lineage).
It is exciting that the supersite where the deletions occur targets the neutralizing antibodies. For the authors of the cited study, these deletions represent a historical genetic fingerprint. And the future trajectory of the antigenic “minimalism strategy” employed to evade immunity too.
It can be interesting to say that in an immunosuppressed patient, two deletions were also found 67 and 72 days after SC2 infection diagnosis. Immunosuppressed subjects infected with SC2 can be veritable mutation factories. Like substitutions, deletions matter too.
There are deletions in four of the five variants of concern (VOC) and in three variants of interest (VOI):
The alpha variant has three deletions in the NTD: H69/V70del and Y144del. The last allows the coronavirus to evade antibodies directed against this area of the genome. Moreover, the most exposed part of the NTD on the virion surface is a hot spot for peak mutations. Preliminary evidence shows that the H69/V70del deletion could improve the adaptive viral condition.
The beta variant also has three 241/242/243del deletions in the NTD domain. The gamma variant has no deletions in NTD.
The delta variant until now dominating the planetary epidemiological landscape carries the 156del.
The omicron variant has several NTD deletions: 69/70del, V143del, Y144del, Y145del, and N211del.
On VOIs, among the six accepted (epsilon, zeta, eta, iota, kappa, and lambda), only lambda has the deletions 247/253del. B.1.525 variant (eta) described in the United Kingdom and Nigeria has the H69/V70del and the Y144/145del deletions in the NTD. The iota variant (B.1.526) detected in New York has only the Y144del deletion. We recall that the H69/V70del deletion exists in alpha and omicron variants of concern.
Specific substitutions occur as if there were a concerted viral evolutionary strategy.
But there are not only deletions or nucleotide losses in the NTD. There may also be gene insertions.
Insertions are a new issue of concern.
Despite their evolutionary importance, insertions are the neglected family for genomic changes in pandemic coronavirus. Insertions have, so far, deserved less attention from researchers. And near zero presence in the media and networks. Except for a select group of excellent researchers very actives in Twitter. That is due to their rarity (0.4% of mutations). And to the difficulty of interpreting the significance.
One knows their mechanism, but not their origin. But insertions have evolutionary relevance. They are less random than other mutations and need a source of RNA origin, of viral or animal nature. Although they can occupy between one and nine codons (sequencing 1–9 amino acids), 99% have three codons. Every codon ever sequences three amino acids.
It is unknown the significance of this type of mutation. A single 12-nucleotide insertion, in which we highlight the two central codons or triplets. This sequence (CCT-CGG-CGG-GCA) encodes four amino acids. They are PRRA or proline-arginine-arginine-arginine-alanine. It occurred in the S gene and allowed the evolutionary acquisition of the furin cleavage site.
The furin fusion site was an evolutionary advance of the virus.
The furin fusion site is paramount for virus entry into the host cell. Fusion is the process that follows the binding to the membrane ACE2 receptor. The sequence’s central nucleotide triplets CGC-CGG (C for cytosine and G for guanine) encode the amino acid arginine. The letter R represents it. Here, the RR (arginine-arginine) doublet is unique to SC2.
This important cleavage site is an evolutionary determinant of the pathogenicity of coronavirus. It is, to put it, the flame that ignited the human pandemic. This insertion is absent or rare in some bat species, including RatG13, the closest zoonotic virus to SC2.
Researchers from two Italian universities (Trieste and Verona) reported (December 9, 2021) the independent occurrence of 41 different insertion events at the same NTD site. The events were between Val213 and Leu214. By analogy with the RDRs of the deletions, they named the site Recurrent Insertion Region-1 (RIR1).
Two lineages carrying RIR-1 had a sizeable international spread. But without causing harmful effects. There are insertions in some immunocompromised subjects with prolonged infection. Also, they have been achieved in the laboratory by a passage in cell culture in Vero cells.
The presumed importance of insertions.
One can speculate that RIR-1 insertions are essential because they provide an evolutionary advantage and support convergent evolution. Convergent, in the evolutionary sense, means a positive selective pressure. It occurs when two variants have the same mutation with no connection between the cases. As a result, the mutated virus improves biological fitness.
It does not appear that the insertions participate in immune escape. However, their intervention in the T lymphocyte response is under scrutiny for their impact on T epitopes. Finally, there seems to be a correlation between RIR-1 mutations, the occurrence of non-homonymous substitutions at the RBD site, and specific RDR deletions discussed above.
The genomic origin of insertions.
Which is its origin? This question is interesting. From a theoretical point of view, insertions can come from three sites:
- From the genomic material of the virus itself.
- From other different viruses.
- From the host (animal or human).
The incorporation of genetic material can proceed from distal regions of the same genome. More rarely, the genomic material would come from a co-infecting endemic coronavirus. Finally, the rarest and most far-reaching possibility is the provenance of the RNA from the human host.
The Imperial College of London published provocative data about the last option. The insertions in a region of the NTD domain and the furin S1/S2 cleavage site are access points for deletions.
As the possible origin of the insertions, investigators find duplications of adjacent codons or codon pairs. One example is the ins678QT insertion, present in several lineages. Others (the 12-nucleotide insertion at the S1/S2 site in the so-called Russian AT.1 lineage) come from elsewhere in the genome. Lineage AT.1 shows high homology with the 3' UR region of the SC2 genome.
The genomic changes are the result of copy choice recombination. It is a common mechanism during coronaviral RNA synthesis. In addition, it is a template change to produce subgenomic RNAs (sgRNAs).
Finally, other insertions may originate from human host RNA. The authors compared the insertions with the host transcriptome using the appropriate method. The transcriptome is the set of transcripts or gene reads present in a cell. For example, a cluster of viruses in Virginia (USA) had a 12-nucleotide insertion in the N protein. It has high homology with the human ZBTB20 transcript. The Mu variant (Colombia and Ecuador) has an insertion homolog with the human TRIM28 transcript.
The existence of other insertions of the host and viruses of the subgenus embecovirus supports their hypothesis. It suggests that the changes can act as wildcard mutations. It is a fact that alters the phenotype of the emerging viruses. Also, it occurs in flaviviruses (such as bovine pestiviruses). And in some influenza viruses (H7N3) too.
The hypothesis opens new questions among the many surrounding this extraordinary zoonotic virus. An equal genetic analysis was in the omicron variant. They found a high homology between the insertion (ins214EPE) of omicron and the human TMEM245 mRNA.
In support of the above, other investigators find this insertion in the omicron variant. The authors study 1523 viral lineages: 5.4 million genomes from 200 countries and territories (December 2019 and November 2021, GISAID archive). They compared them to the current VOCs and VOIs and detected 26 mutations in omicron (23 substitutions, two deletions, and one insertion). All different from those known in VOC/VOI.
The ins214EPE insertion was unknown in any other lineage. Moreover, it is specific, for the moment, to omicron. Therefore, the origin of ins214EPE is uncertain.
There is the possibility of origin in an endemic respiratory coronavirus. The best candidate is HCov-229E. Co-infection in the same host: in its favor, an identical nucleotide sequence is in HCoV-229. And there is co-expression of the ACE2 receptor and ANPEP receptor. Both are in enterocytes and human respiratory ciliary cells. This fact suggests that the human host may act as a site of genetic interaction. The authors call it an evolutionary sandbox recalling the porcine test tube phenomenon in the genesis of influenza pandemics. But the insertion may also come from other non-coronaviral respiratory viruses.
Or the origin is human because the researchers detected many fragments of the transcriptome harboring nucleotide sequences identical to ins214EPE. There are more than 750 fragments of the human genome with sequences like the coding of ins214EPE. They include the mRNAs of SLCA7 and TMEM245.
Conclusion.
- Genomic substitutions (or letter exchange) are a primary mechanism in the coronavirus evolution. Many of these mutations are of epidemiological and clinical significance.
- Deletions occur in the N-terminal domain. They favor immune evasion of neutralizing antibodies. Moreover, their detection seems to precede outbreaks of cases around three months before.
- The significance of the insertions, the third class of mutations, is a question not elucidated still. Maybe, they can act as permissive mutations compensating for infectivity deficits of other mutations.
The mutations issue is a whole world in continuous change.