SARS Cov 2 classification with variational autoencoders.

17 min readDec 15, 2021

Since the identification and later sequencing, the number of available SARS-Cov-2 sequences continues to grow. This continuous surveillance made possible the fast detection of different variants. Depending on the number and importance of changes on its genetic information such variants are classified into three groups.

A variant under monitoring(VUM), “variant with genetic changes that are suspected to affect virus characteristics with some indication that it may pose a future risk, but evidence of phenotypic or epidemiological impact is currently unclear”.

A variant of interest (VOI) is a “variant with genetic changes that are predicted or known to affect virus characteristics such as transmissibility, disease severity, immune escape, diagnostic or therapeutic escape. And identified to cause significant community transmission or multiple COVID-19 clusters, in multiple countries with increasing relative prevalence alongside an increasing number of cases over time, or other apparent epidemiological impacts to suggest an emerging risk to global public health”.

And a variant of concern (VOC) meets all the criteria to be defined as a VOI but with one or more of the following characteristics. “Increase in transmissibility or detrimental change in COVID-19 epidemiology. Increase in virulence or change in clinical disease presentation. Or decrease in the effectiveness of public health and social measures or available diagnostics, vaccines, therapeutics”

These variants’ definitions were taken from the WHO website and are adjusted periodically. Each variant is assigned to a lineage or established as a new one using different methods. Then an expert panel discusses the available information and presents it to the public.

Computational Methods for Variants Identification and Classification

There are two main tools used to find the particular lineage of a SARS-Cov-2 sequence. The first one is Nextstrain, an open-source project that aims to track pathogenic genome data. On its website, we can find the latest SARS-Cov-2 sequences analysis. The main component of this analysis is the SARS-Cov-2 phylogenetic tree. Making it the central tool used to understand the evolution of the virus. As well as to detect new variants.

To build the phylogenetic tree, Nextstrain uses a tool named TreeTime. It finds an approximate maximum-likelihood configuration of the phylogenetic tree, with large sequence alignments as input data. But, this process is computationally expensive, time-consuming and it will only be helpful if the sequence to analyze is novel.

To bypass the novelty unknown, the Pangolin tool provides a machine learning model to classify unknown SARS-Cov-2 sequences to an already known lineage. Aiming to filter the high volume of sequences due to genetic surveillance. This classifier uses a one-hot encoding of the SARS-Cov-2 sequence and the lineages as labels.

Sequence Representation

Inputs used by the different tools represent some of the many ways a sequence is represented for computational applications. A one-hot encoding of a sequence is perhaps the simplest of the representations. In the case of a biological sequence each of the bases is changed to a vector of size four and the kind of base is encoded by a 1 at their respective position. Then each base in the sequence is encoded with the same procedure, resulting in an array of shape (4, sequence length).

Another common representation scheme is the use of k-mers or k size fragments of the sequence as representation. These k-mers can be used to define a new one-hot encoding. Or the k-mer frequency can be used as another form to represent the sequence.

The frequency of each k-mer will depend on the way the sequence is divided. If the sequence is divided into non-overlapping fragments, there will be (sequence length // k-mer size) fragments. While a sliding window will result on (sequence length — kmer size ) fragments. If the sliding scheme is used an interesting property rises. Each fragment contains k-1 matching bases matching a De Bruijn graph. By using the De Bruijn graph connectivity scheme we can add the connectivity relationship between consecutive k-mers.

Another form to encode the connectivity relationship is by dividing the sequence into 2k-mers. Then splitting the fragments into k-mers and adding a link between both k-mers. Under this scheme, the order of the connections is lost. But the frequency of such connections is added to the representation. As well as being able to encode larger k-mer fragments.

Both connectivity schemes aim to encode the relational information between the different symbols or k-mers that exist on a sequence rather than the symbol or k-mer itself.

By using graphs to numerically define a sequence we can use many of the different matrix representations of graphs. The node degree of the undirected De Bruijn graph will be equal to two times the frequency of each k-mer. While the adjacency matrix will show the frequency of each connection in the sequence.

Low Dimensionality Representation

From the graph representation of a sequence, several datasets can be created. The first and simplest one might be the dataset containing the k-mer frequencies of the sequence. Stacking the different k-mers up to the 4-mer of each sequence results in four visible clusters in the plot.

Using this dimensionality reduction technique results in clear clusters in the data. But a consequence of drastically decreasing the dimensionality of the data is the amount of lost information. Another option to reduce the dimensionality of the data is a variational autoencoder (VAE). In a VAE, the low dimensional representation will be constrained to behave as a normal distribution. Also, the learned representation might contain biologically relevant information.

VAE encoding of the k-mer data results on a series of clusters aligned through the X-axis. Clear separation between clusters appears between a couple of clusters, while others are merged.

To leverage the graph representation of the sequences a new dataset can be constructed by calculating the difference between the adjacency matrices from both connectivity schemes. And rearranging the different matrices into a single one.

Creating a two-dimensional array that can be used as a single-channel image. This allows encoding the connectivity information as well as being able to encode larger k-mers. From this data set, a convolutional variational autoencoder can be used to find a better low-dimensional representation of the data.

The resulting low dimensional space of this new autoencoder results in a similar behavior from the previous one. Except that the clusters are well defined over the x-axis.

Latent Representation.

The main advantage of a VAE is the small latent representation that can be used for other tasks. Clusters obtained from the previous autoencoders could represent a single variant or encode another kind of information. Also, the latent space encodes some meaning inherent to the data. For example, in a faces data set the latent space could encode face expression (happy/sad) or face pose (up, down, left, right).

Both k-mer frequency and k-mer connectivity-based representations model a sequence as a whole. Preventing the reconstruction of the sequence. However, biases towards specific codons or some other encoded meaning stll can be found. In a previous post, PCA analysis of k-mer frequency data resulted in clusters that contained sequences from a single geographical origin. Hence it could be possible that the latent dimensions encode for some sort of geographical encoding. While another option could be some sort of time encoding.

Although those are not the only possible options, those are the ones that can be tested with the available sequence metadata. To visually encode the different features I’m presenting a color encoding scheme. In the case of geographical locations, equal colors will represent the same location. However, similar colors will not share any kind of geographical similarity. That will not be the case for time encoding, similar colors will represent closer periods. Time will be encoded as the number of the week, regardless of whether it’s the first or the second year of the pandemic. The time encoding scheme will be based on the isolation date of the sample.

Also, the different models will be renamed, the PCA model will be referred to as Katya. The simple variational autoencoder will be named Nina. And the convolutional autoencoder Masha.

Geographical encoding.

In terms of geographical encoding neither Katya, Nina, nor Masha can’t find any specific pattern. This could be in part due to the heavily biased nature of the dataset. Although it contains sequences from different parts of the world, about 85% of the data is from the USA population.

Time encoding

Time-wise, Katya can separate the pandemic into two specific periods. Although there is some mixing, this could be due to the cyclical time encoding. As the last part of the year is closer to the initial using this approach.

Meanwhile, Nina can also separate the data into specific periods, it also adds structure and constraints to the data. The x-axis from this latent representation appears to encode some form of time encoding.

Masha is also able to encode some kind of temporal or seasonal information into the representation. However, each cluster appears to contain both periods mixed inside the cluster. But the trained network is unable to separate these two time periods into single clusters. Clusters at the extreme of the x-axis appear to start to separate into low time and high time encoding. Further improvement of this model could lead to an alternating pattern of low and high time-encoding.

Cluster Analysis.

Performing K-means clustering on Katya confirms the initial observations of the time encoding. Clusters contain sequences sampled on the first half of the year for one cluster and the remaining second half of the year for the second. Also, both clusters start to merge in the middle part of the year.

While Nina, by adding structure to the data each cluster appears to move through time. Showing two transition steps between the most separated and highly populated clusters.

Although Masha is unable to separate into time dependant clusters, the histograms of each cluster confirm the time encoding visualization. Each cluster within Masha contains information on both periods.

Variants.

Up to this point, the models appear to encode time or seasonal information within the SARS-Cov-2 sequence. However, the remaining dimension could also encode information about the variants. Two approaches are used to test this idea. On the first one, each pangolin lineage represents a single color. While the second consists of a binary encoding where the A and B lineage branches are represented by different colors. As the data set consists mostly of sequences from A or B lineages.

Using that simple color encoding, all three models can encode the variants regardless of the coloring scheme. And this particular encoding is closely similar to the temporal encoding found before.

This would suggest that each branch in the pangolin classification represents a variant with better evolutionary performance to a particular time of the year. If that is the case there should be a difference in the proportion of the usage of specific nucleotides in the sequence. Codon usage bias index or GC content are some examples of features used to test for biological adaptation. But, those features need to be calculated from the open reading frame to have accurate results. But by constraining how the sequence is analyzed perhaps many of the encoded features in the sequence might be lost. Therefore a simple approach will be to check the frequency of each of the different bases within the sequence.

Calculating the histogram of nucleotide usage from one cluster within Nina and comparing it to the remaining ones results in slight shifts in frequency. SARS-Cov-2 sequences collected in the second half of the year contain less Cytosine compared to the remaining period. While the opposite is true for Thymine/Uracil.

Performing a similar analysis on Masha results in similar shifts on Cytosine and Thymine/Uracil. However, the shift in the Cytosine or Thymine/Uracil content appears to be greater compared to the shift observed in Nina. But, this leaves again the x-axis without a clear interpretation of what kind of information encodes.

As the data contains a high number of sequences from Covid-19 US cases, perhaps the x-axis is encoding some kind of geographical specific to that location. Plotting the geolocation of each cluster from the first and second half of the year results in an alternating pattern. Figures represent each cluster, colors represent elevation and the x-axis and y-axis represent longitude and latitude.

However, this alternating pattern does not match the clusters within Masha. But this simple observation hints that the x-axis within Masha encodes for some kind of geographical or environmental variable.

Data biases.

Even when the three different models can encode temporal and environmental information. The ability to generalize and extrapolate to general conclusions might be hindered by the biases presented in the data. Data from the first year of the pandemic represents only 10% of the total. Which makes the analysis heavily biased towards the second year of the pandemic.

Geographically, the data can be clustered into heavily populated areas inside the US. Also, I was unable to download the geolocation of some of the cities in the metadata. This lowers the ability to make predictions from Masha, the only model to show some resemblance to a kind of environmental encoding.

However, I acknowledge that the different models can help us to have a better understanding of the development of the Covid-19 pandemic. And they describe the pandemic in the US with accuracy.

From the three models, three key insights are found. A seasonal component, perhaps encoded as the shift from the A lineage to the B lineage. A shift in Cytosine use from high content in the first half of the year to low content. Also a shift in Thymine/Uracil in a mirrored way. And a hit to geographical or environmental encoding.

Seasonality and adaptation

Seasonal trends are one hallmark of many infectious diseases. Specific to the COVID-19 outbreak Caetano-Anolles and collaborators showed the existence of a cyclical pattern of mutations in the RBD domain of the spike protein. This cyclical pattern also reflects a cyclical pattern inside the SARS-Cov-2 sequence.

Another cyclical mutation pattern can be inferred by looking at the different genomic surveillance reports in Mexico. For the first half of the year, most of the sampled sequences were from the B main lineage. While for the second half and counting the B lineage started to be displaced by the A lineage.

Reasons behind the shift of one lineage to another and seasonal trends are topics of continuous research. One hypothesis could be the seasonal adaptation to the host by the virus. In lung tissue around 2000 different genes are upregulated by night and 1500 genes upregulated by day. This behavior is also repeated throughout the seasons. Lung tissue differentially expresses around 20% of genes at least in one season. Then the virus by adapting to the available resources to copy its genetic material ends up moving from one lineage to another. This leaves a period to adapt to the newly available resources. Once this adaptation period is completed and the virus can spread more efficiently.

If this adaptation process takes place it could be within the same host or in small increments at different hosts. And could explain the oscillatory behavior of the different pandemic waves. Making it a natural and cyclical process of the virus population cycle. A necessary step for this model to work is the presence of coinfections inside a single host. Cases, where coinfections were detected, remain scarce. However, reports of coinfections in different parts of the world start to become more common.

Reports of single cases of coinfections were reported by Cattoir regarding a Belgian patient. Mohammed Baqur S. Al-Shuhaib reported several cases of coinfections in Babylon Iraq. While Akimkin reported a patient where samples were taken at different times through the infection. Both samples were sequenced and the main lineage was different between samples.

Coinfection with different strains of the virus rises concert among scientists. As recombination events could take place leading to new and different strains of the virus. Wu and collaborators collected sequence reads to analyze for signs of coinfection. This resulted in samples with signs of coinfections with two or three different lineages. Yet the ratio of coinfection events drastically drops in the middle of the year.

Even when that finding could contradict some of the previous ideas. Reanalyzing sequence read for signs of coinfection could be helpful to understand if a yearly dynamic adaptation process takes place.

Environmental and seasonal treatments.

Nucleoside analogs are one of the most common pharmacological treatments for viral diseases. In the case of Covid-19 Remdesivir is perhaps the most widely known antiviral treatment. In a meta-analysis, Hsueh showed that Remdesivir is associated with a better clinical outcome. Yet, even with a numerical reduction in mortality, it was not statistically significant. If we analyze the overall content of adenine, the molecular analog of Remdesivir. We can see that there is a slight deviation in the histogram, pointing to a bimodal distribution.

But, if we look at the adenine distribution within the different clusters in Masha only one cluster contains sequences with high and low adenine content. This could indicate that there might be a subset of Covid-19 more susceptible to Remdesivir. And this susceptibility is specific to a particular environment.

A clear definition of nucleotide usage between the different clusters is perhaps the most useful piece of information. This shift can be exploited for treatment as described before. Another option for treatment can be found in the shift of Cytosine or Thymine/Uracil. Literature about the use of Cytosine analogs for covid treatment is scarce. However, a reduction in hospitalization risk among HIV patients might hint at the possibility to use such antivirals. Studies in Spain and Germany found a reduction in risk of hospitalization, as well as most of the cases, were classified as mild. In both studies, the participants were on antiviral treatment containing at least one Cytosine analog. While in the case of Thymine/Uracil, Sofosbuvir is the main antiviral being studied. A meta-analysis by Chih-Cheng Lai showed that Sofosbuvir increased the recovery rate and decreased mortality.

Seasonal and environmental patterns inside the SARS-Cov-2 sequence can be taken into account to further test the efficacy of different treatments. Clinical trials and continuous research on the topic are the only methods to fine-tune possible treatments.

Models reproducibility

Over this post, a series of models were analyzed to understand how the low dimensionality representation of the sequence data encodes biologically relevant information. Each model can be trained without the use of highly specialized hardware. And they have already shown their capability to understand and provide some insights from the sequences. Also, for the two variational models, only one latent dimension was more or less characterized. This leaves enough room to add more metadata to newly sequenced SARS-Cov-2 strains or to try to add more to the already existing ones. Improvement or development of new generative models could lead to patterns that lead to applicable actions. From predictions of the possible outcome to resource management.

Also, this post represents a complement to a series of Kaggle datasets and notebooks. The different data sets can be downloaded or used for analysis within the platform. While the notebooks present an exploratory analysis of the k-mer dataset as well as the different neural networks and training parameters used to define and train the different models. Scripts used to filter the sequences, create the datasets, and train the neural networks can be found on my GitHub by clicking here. The main difference between Kaggle and Github is the number of epochs used to train Masha. Huge thanks to Katie Green for signing as a referred member, hope not to disappoint. And if you want to support me to continue this taco-fueled machine learning endeavor please consider joining Medium as a referred member by clicking here. You can access all the content on the platform while a fraction of the fee will directly support me. Please stay safe and talk to you in the next one.