SARS CoV-2 genome classification

17 min readMar 15, 2021

Reported first in Wuhan capital of the Hubei province in China as a novel respiratory disease, the coronavirus disease of 2019 or COVID-19 pandemic has taken over the world since the end of 2019. With symptoms similar to the common cold, the detection of the COVID-19 is somewhat elusive and can be fatal if it’s not detected at an early stage or proper medical care is not available. The pathogen responsible for COVID-19 was detected and sequenced after a short time and named SARS Cov-2.

About a year has passed since the first SARS Cov-2 sequence was made publicly available, and new sequences from other outbreaks around the globe were made public. The GISAID and the NCBI are two of the most widely used services to upload and access the SARS Cov-2 sequences for scientific research. The following describes an exploratory analysis of the SARS Cov-2 collection of complete genomic sequences.

Basic Characteristics

The sequences were downloaded from the NCBI COVID-19 resources on February 19 in a FASTA format then sequences were loaded using biopython and only complete genomes were selected using regular expressions. The description over the FASTA sequence was used to scan for complete genomes and ‘(.*)’ to ignore other characters.

From that selection, the majority of the sequences have a length of around 30 000 bp (base pairs). Removing the outliers defined as those sequences with less than 10 000 bp shows that the majority of the sequences have a length in the range between 29 500 and 30 000 bp.

Those selected genomes contain the DNA sequence of the SARS Cov-2, and that sequence must contain only four different letters A for adenine, C for Cytosine, G for Guanine, and T for Thymine. Nevertheless, sometimes DNA FASTA sequences contain more characters and those characters mean the possibility of a subset of the DNA available characters. For example, the letter N on a FASTA sequence means any of the four characters can be found in that position. To check for sequences that contain the standard set of characters first the sequence is fragmented into a one-element list and the unique elements of that list are obtained. If the length of the unique elements set is equal to four, that sequence is said that uses a canonical alphabet.

To accelerate the process, the iteration was parallelized with the multiprocessing package. An index is returned that contains the location of the canonical and non-canonical sequences. That analysis results in 30211 canonical sequences and 13460 non-canonical sequences. As the non-canonical sequences with only two different characters can generate a set of sequences much greater than the canonical sequences the non-canonical set of sequences will not be analyzed. Thus the canonical set of sequences will be taken as the ground truth.

From those canonical sequences, the majority of them come from the USA or Australia, with about 80% of the total sequences. From the remaining locations, Netherlands and France contribute the most with new SARS Cov-2 genomes. It’s important to point out that many sequences in the location found in the description is the one of the outbreak rather than the country, or a different abbreviation for the country. All those sequences were classified as Non-Standard.

While the complete genome size reported from each country has a mean of around 29800 bp, being the USA, Australia, and the Non-Standard locations the ones with the higher variability in size, skewed towards a smaller complete genome.

K-mer

Like the n-gram model of early computer linguistics, the k-mer refers to a sub-sequence of k elements within a biological sequence. The division scheme used to obtain the k-mers can affect the number and kind of k-mer obtained in a sequence. The non-overlapping division scheme divides the sequence into k-size fragments and the total number of k-mers obtained is equal to the length of the sequence divided by the length of the k-mer. While the sliding scheme takes a k-size fragment from the sequence and slides through the sequence one character at each step, the total number of k-mers obtained is equal to the length of the sequence minus the k-mer length. To obtain the sequence k-mers the sliding scheme will be applied with the following.

Know that the k-mers can be obtained for each sequence the next step is to determine the appropriate length of the k-mer to be used. By plotting the number of unique k-mers of a given size that can be generated by the data set it can be observed that the 6-mer is around 80% away from the theoretical maximum number of k-mer. And from that point forward the number of uniquely generated k-mer increases rapidly reaching the theoretical maximum. To prevent the generation of a data set with a greater number of k-mers than samples the max k-mer to be analyzed will be up to the 5-mer.

To analyze the unique k-mers, the frequency of each of the k-mers is measured from the data set. First, a function is defined to count the frequency of each k-mer in the sequence. Then that function is paralleled for a fast k-mer frequency calculation.

Know with the k-mer dataset created is time to look for patterns in the data, a quick way to look for patterns is just to use a simple scatter plot. However, as the resulting data set is multidimensional thus the PCA projection can be used to look for patterns. The following shows the PCA projection of the 1-mer then the 1-mer plus the 2-mer and so on for the available data. It can be noticed that as k increases several data points start to form different clusters.

To analyze the clusters, the labels of each cluster are obtained using the DBSCAN algorithm and the elbow method to determine the min distance between neighbors.

1-mer+2-mer+3-mer left, 1-mer+2-mer+3-mer+4-mer right

Each different color represents well-defined clusters, while black represents outlier samples. From those clusters, two are selected to be analyzed for differences between them. The particular reason why these two clusters will be explained later.

The clusters

The first cluster or cluster one contains 69 sequences while cluster two contains 7978 sequences. The mean sequence size of cluster 1 is around 29520 with few outliers. While the sequence size distribution of cluster two appears to be bi-modal, with two peaks, around 29810 and 29840 bp.

To make a pairwise comparison of the similarity between the samples of each cluster the euclidean distance, the correlation, the cosine similarity, and the city block distance are used as similarity measures. From those four measurements, only the correlation shows a distribution with only one peak, while the three remaining measurements show a bi-modal distribution, however, the second peak is several orders of magnitude smaller than the main one.

With the same approach, both clusters are compared with the Wuhan outbreak reference genome to assess the dissimilarity between the clusters and the reference genome. Cluster one is as dissimilar to the reference genome as is dissimilar to cluster two, as most of the similarity measures lay on a similar range as the ones from the previous analysis.

While cluster two is more similar to the reference genome, mean euclidean distance, city block distance, and cosine similarity are around half the values found when compared to cluster one. However, correlation values drop to a mean around 0.4, which could mean that although the samples are close, there is some variation between the different k-mers.

K-mer and reading frames

The k-mer model can extract information from sequences with biological and functional meaning. To understand better the intersection between the k-mer model and biological sequences let’s explain the basics of DNA and gene or protein expression.

A DNA sequence contains the information to make the different proteins that make our cells, tissues, and body. DNA is transcribed to messenger RNA or mRNA, then is read in the ribosome by fragments of three mRNA bases or triplets. Particularly the non-overlapping 3-mer data have a well-defined biological meaning, however, if the 4-mer is taken two overlapping sets of triplets can be obtained. Following that idea, the 5-mer contains three sets of overlapping triplets. Each triplet from the 4-mer or the 5-mer contains a range of possibilities on how a sequence can be read, those possibilities are also known as reading frames. Thus, when analyzing up to the 5-mer or the sliding 3-mer, essentially it has encoded information for up to three different reading frames.

Aside from methionine and tryptophan that are encoded by one triplet, the remaining amino acids are encoded by more than one triplet or codon. Different codons are used in different frequencies, that phenomenon is called codon usage bias or CUB. CUB is a natural phenomenon that has been studied by several scientists in different model organisms. Changes in CUB from a pathogen such as a virus will reflect an adaptation of that pathogen in some form.

To analyze the CUB it’s important to have the open reading frames of the SARS Cov-2 virus, an open reading frame (ORF) is defined as a part of a sequence bounded by a start codon and a stop codon. As simple as it may sound, finding ORFs is a difficult task especially on a virus where space limitations on its genome make the virus use overlapping reading frames. As reported by Wei et al [4] SARS Cov-2 contains several overlapping ORFs. It can be expected that with more time more overlapping ORFs can be found. Hence to alleviate the existence of overlapping ORFS the sliding 3-mer will be used to evaluate in an exploratory manner for a possible change in CUB. As that analysis can’t be said to be equivalent to an analysis of CUB it will be defined as pseudo-CUB.

The distribution of the different triplets found on cluster one and cluster two show no extreme deviations from the mean between the different triplets, with only around a 10% deviation from the mean.

Comparing cluster one with cluster two for the pseudo-CUB results in a greater usage of ‘T’ containing codons for cluster one, while cluster two appears to favor ‘C’ or ‘G containing codons. A ‘GC’ CUB is often associated with greater stability on the resulting DNA strand due to the ability to form three hydrogen bonds, however, the ‘GC’ content is not the only factor that contributes to DNA stability.

Blue favors cluster one, red favors cluster two

Comparing cluster one and cluster two with the reference sequence shows that on cluster two few bars cross over 0.2 while the comparison with cluster one shows a greater difference. Cluster one appears to favor codons with ‘T’ compared to the reference sequence. While Cluster two appears to have a mild increase in most of the codons, about one-third of the available codons show a small increase in frequency.

Top cluster one comparison, bottom cluster two comparison. Blue favours respective cluster, red favors reference genome

Regardless of its origin CUB regulates translational efficiency (going from mRNA to protein), as many of the resources for translation are shared inside the cell. Thus the sequence of highly expressed genes of a cell will determine that cell CUB. In the case of a viral infection two phenomena are in place, the CUB from the host organism and its target tissue, in the case of SARS Cov-2 the respiratory system, and the CUB of the virus. There are several reports on how CUB from SARS Cov-2 changed thru the pandemic. For example, Poon et al [2] reported that the amino-acid usage pattern of SARS Cov-2 was found to be similar to bat and human SARSr CoVs. While Khodary et al [1] reported some level of codon adaptation towards lung tissue, specifically the ORF7a or nucleocapsid and the surface glycoprotein from SARS Cov-2.

Changes in the SARS Cov-2 genome that affect CUB are not necessarily mutations that create a new functional viral protein. Mutations that affect the CUB theoretically will enhance translation efficiency, which means that fewer virions (viral particles) are needed to infect and cause disease. Does that mean that eventually, SARS Cov-2 will have a CUB equal to the one found in the respiratory system? Perhaps, but that will be detrimental to the virus. Most viruses already have a strong promoter. a piece of sequence that promotes the translation of the virus, thus a strong activation by the promoter and high efficiency will exhaust the translational machinery of the cell. From the point of view of the virus, a non-optimal CUB is needed.

Clusters Origin

Know with better knowledge about the SARS Cov-2 genome and how some changes on the viral genome can affect the spread and adaptation of the virus let’s find out the origin of the clusters. Four different clusters and a set of outliers were found using the DBSCAN algorithm, those will be named A, B, C, and D.

Cluster A contains sequences mainly from the USA followed by non-standard locations, the Netherlands, and other countries.

Cluster B contains sequences from a non-standard location, a quick examination of the description attribute from the FASTA sequence shows that all the sequences are from PER or Peru.

Cluster C contains a mixture of non-standard locations nonetheless the majority of the sequences are from the USA.

Cluster D contains only sequences from Australia, also most of the Australian sequences can be found in this cluster.

And around 20 different sequences did not meet any criteria to be classified as one of the many clusters. And most of those sequences can be found close to a particular cluster.

From those four clusters, cluster one is the same as cluster B and contains only sequences obtained from Peru, and cluster two is the same as cluster D and contains only sequences from Australian isolates. Both clusters were selected for the difference in how the pandemic has affected both nations. Peru faced a health care system saturation on the first COVID-19 wave and the total number of cases is around 1.3 million cases. While Australia opted for a strict lock-down and closed its borders to foreigners on the 20th of march. The total number of cases is around 29000, which means that one of every four COVID-19 cases was sequenced for further analysis.

Although it can be noticed a socioeconomic difference between nations, both applied similar measures to alleviate economical and healthcare pressure. Both release emergency funds for businesses as well as workers. Both enforced social distancing limited the number of participants in social gatherings and used in different extents curfews. Perhaps due to some cracks in the system, miss calculations or other sociological problems both nations ended on extreme sides of the COVID-19 pandemic.

Leaving out how the pandemic was managed by both nations and just looking at the total number of cases the difference is several orders of magnitude. With a high number of infected individuals, the probability of SARS Cov-2 mutates increases. Hence mutations that enhance host adaptation such as a change in CUB will be easier to appear in a population with higher levels of exposure to the pathogen.

Classificant and K-mers

To have a model that can to some extent classify from sequence to geographical origin can be a helpful model for many applications. Nonetheless, the k-mer model is not an exact model, from the four clusters two had a common origin while two were a mixture of origins. Hence, some scenarios will be discussed.

Geographical origin.

It could be said that Cluster C perhaps contains sequences from Mexico, a neighbor of the USA, and cluster A is a set of miss classified sequences. However, cluster C indeed contains sequences from Mexico but also from Chile and other locations. Thus the common geographical origin classification might not be true.

Common outbreak.

Cluster A contains the majority of the sequences also contains the reference sequence. A possibility might be that the different clusters that surround cluster A are just outliers from cluster A that propagated through different countries.

Common Adaptation.

One of the many proposed explanations for CUB is the selection theory, meaning that codon bias contributes to the efficiency of protein expression, that contribution leads to a better performance of the individual and positive selection. In the case of SARS Cov-2, small changes in the virus sequence might be driven towards replicate the host CUB and being able to replicate more efficiently in the host without any modification at the protein level.

Mix of everything.

With the high rate of mutation on RNA viruses, it could be possible that someone infected with SARS Cov-2 travel to another part of the world and generates new SARS Cov-2 mutants with a better adaptation to that particular part of the world. Thus that variant becomes dominant in that geographical location.

Surveillance, Outbreak characterization, and K-mers

Even though the exact interpretation of the resulting clusters is a matter of discussion, one key aspect from the pseudo-CUB analysis is that a set of triplets can be found in a greater frequency in one cluster compared to the other one. An extension of that is that a set of k-mers appear on a different frequency from one cluster compared to the other one. Thus when the k-mer data is projected by PCA the projected space represents the data as separated as possible. With that characteristic, the PCA projection of k-mer data can be used for epidemiological surveillance.

If the k-mers of the SARS Cov-2 virus that affects a specific population are already measured and characterized, then it could be possible to detect new variants from local outbreaks, or the introduction of new variants from other parts of the globe.

For example, if only the samples from cluster two are projected by PCA two clusters are obtained. From the 1-mer up to the 4-mer two well-defined clusters can be observed, while the 5–0mer exhibit around six defined clusters. It will be interesting to know if both clusters represent both SARS Cov-2 waves that Australia has faced off. Know if some samples from random locations are added to the cluster two data from two to twenty new different samples, it’s impossible to differentiate from cluster two data.

But, from 200 to 2k samples well-defined structures appear, with 200 new samples the 4-mer and the 5-mer can differentiate the original samples from the outliers. And with 2000 samples the 3-mer, 4-mer and 5-mer can differentiate the original samples from the outliers

With a couple of hundred new sequences, the direction of the pandemic on a given location can be tracked down. Both the emergence of new strains or the consolidation of one can be examined with the k-mer data.

Pros and Cons

A neglected fact of this data set is how unbalanced it is regarding the nations that report complete SARS Cov-2 genomes (at least for this database). As most of the sequences are from the USA and Australia the bias on the data could make it easier to differentiate Australia and the USA. However, two well-differentiated USA clusters were found and the distance between clusters is large enough to make them recognizable. Also, a biological meaning can be found, as the k-mer data was able to differentiate between two very different pandemic scenarios and one with limited samples available.

Clusters were obtained from the PCA projection of the k-mer data leading to a loss of information from the k-mer data. From around 80% of the explained variance with the 1-mer to around 27% with the 5-mer data. Even though an insight can be obtained from the clustering of that data, most of the information is lost due to the transformation. A better sequence representation needs to be put in place to obtain a better and general conclusion.

Historically correspondence analysis (CA) has been used to analyze CUB, even that here a pseudo CUB is analyzed, CA was not used. However, PCA and CA are dimensionality reduction techniques that return the projection of the data by an axis. That axis is given by the eigenvalues of a dispersion matrix. PCA uses the covariation matrix while CA uses the weighted variance or inertia. As PCA is one of the most widely used algorithms in data science, PCA was used for its popularity and simplicity.

Most of the context surrounding each k-mer is lost when the frequency of each k-mer is measured. That context could be part of a regulatory region in the sequence or an overlapping ORF just to mention a few. Nonetheless, the k-mer analysis can be useful to evaluate multiple reading frames and is easier to solve and scale compared with multiple sequence alignment.

Know you have an example of how to use some basic tools from data science to visualize and analyze SARS Cov-2 sequences and obtain some useful insights. Also how to use those tools to propose a new data science application, The complete code for this post can be found on my GitHub by clicking here, and the data set can be obtained from the NCBI COVID-19 by clicking here. Just remember that the number of SARS Cov-2 sequences grows every day and perhaps some of the obtained results may vary. Please stay safe and talk to you soon.