Characterization of non-synonymous SNPs in spike protein of SARS CoV2 variants from different world populations.
Abstract
Covid-19, a respiratory disease caused by severe acute respiratory syndrome corona virus 2 (SARS COV-2) and is responsible for more than 28,000,000 infections and more than 900,000 worldwide deaths. It was discovered late 2019 and declared a pandemic in early 2020. The rapid rate of mutation of the virus resulted into different variants some with higher infectivity than others. Insilico analysis, with bioinformatics software tools, 144 spike glycoprotein full coding strand sequences were retrieved from 14 different populations spanning 6 continents of different regions; Africa, Asia, Europe, Australia, North and South America these included Benin, Brazil, Egypt, Gabon, Great Britain, India, Italy, Kenya, Nigeria, New Zealand, Uganda, United States of America, and South Africa as well as the reference strain from Wuhan, China where the virus was first identified using bioinformatics tools; MegaX, DnaSP, Clustvis, swiss model and Microsoft excel. A study of the aligned translated amino acid sequence was carried out to identify the effect of the mutations as well as SNP frequency in the spike gene to discern the possible effect on the immunogenicity of the spike protein as well as the frequency of the mutations in the various populations; phylogenetic analysis on the spike gene denotes the evolutionary hierarchy in the different populations and can be used to identify the possible variants present in the different populations; codon usage to identify potential drug targets.
Insertions, deletions and silent mutations were observed in the dataset. Mutations in the nucleotide sequence resulted into change in amino acid sequence. The identified variants in the populations included alpha, beta, eta, gamma, delta, A.23.1 and 19A. Delta variant was the most widely distributed variant in the dataset; amino acids with RSCU score above 5 are potential drug targets. The frequency of each SNP in the data was not determined due to inaccessible high throughput software.