A quick recap
In a series of posts, I used some sampling schemes to preprocess large biological sequences. Particularly the SARS Cov2 sequences, mainly due to their availability from the NCBI SARS Cov2 resources site.
I used two representation schemes, a frequency-based and a graph-based scheme. In the frequency-based, the sequences were divided into overlapping fragments. And the frequency of those fragments was used to find a low-dimensional representation of the sequences. As the overlapping fragments resembled the construction of a de Bruijn graph I just extended the idea by using different graph construction schemes.
Both schemes create a small representation of the sequence, but at the current stage is not possible to recreate the original sequence. However, is possible to get a general overview of the sequence with scarce computational resources.
Applying a PCA or a variational autoencoder VAE to those representation schemes results in a series of clusters with a strong temporal component.
(And from this point on and in the following posts I will refer to sequence encodings as either the frequency-based or graph-based sequence representations. The learned representation will refer to the bottleneck in the VAE or other network. And composition will refer to the frequency-based representation of the sequence. This distinction is made as the single element frequency matches the content of the different nucleotides in the sequence. In that case, the value has a well-defined physical meaning. While the remaining values are not clear. )
Thus the SARS Cov2 sequences contain some sort of seasonal clock inside the sequence. Although this seasonal clock can be a side effect of the sampling bias, the number of isolates for sequencing is about 10 to 20 times higher in the second year of the pandemic. Removing such sampling bias by subsampling the sequences showed simar results, representations with a strong temporal component.
A VAE is constructed by an encoder and a decoder network, the encoder yields the learned representation. While the decoder returns an approximation of the original data point. The decoder network also works as a generative model and offers a way to approximate changes inside the input. Thus changes or properties that yield the temporal component can be traced back by analyzing selected points inside the learned representation rather than the whole dataset. Specific patterns can be obtained by analyzing the characteristics of the VAE latent walk.
The clock inside the sequences is encoded by the change in the frequency of different fragments of 4 bases inside the SARS Cov 2 genome. Also, the temporal information is encoded mainly in the structural components of the SARS Cov 2 genome. Yet this does not mean that the other parts of the viral genome cannot change. But rather those “constant” regions might follow another kind of pattern. Or the sequence encoding is unable to provide enough information to characterize such regions.
Plotting the frequency of those 4-bases combinations through time results in a wave-like pattern inside the plots.
However when instead of the isolation date as a measure of time I use the day duration or day length this wave-like behavior disappears.
The use of day duration as a measure of time was the result of several attempts to merge environmental information and the learned representations. Previous attempts showed an agreement between environmental variables with a wave-like pattern.
Using day duration as a temporal scale rather than the Julian day calendar started to show some particular useful characteristics. Most of the cases were confined to the extremes, on the min and max day duration at a particular location.
It also showed that the rate of change in day duration between consecutive days offered a way to approximate the start and the end of a COVID-19 wave at a particular location. This can be used to establish the relative transmission risk of COVID-19. Joining an environmental change to viral transmissibility, similar to abrupt changes in temperature and the flu and some other winter illness.
Why does the SARS Cov2 virus follow such a scale? is a question to which I have no concrete answer. Nevertheless, the SARS Cov2 genome is similar in composition to a series of genes expressed due to the action of VDR or vitamin D receptor. Vitamin D is produced due to exposure to solar radiation. Yet, it’s also similar to a series of other genes with apparently little involvement with solar radiation. Nonetheless, the temperature is correlated to the learned representation and also correlated to solar radiation. Day duration appears to work as a control variable by maintaining sequence composition constant, and day duration is correlated to solar radiation. And some genes similar to SARS Cov2 are regulated by solar radiation. Thus I think is safe to assume that solar radiation has a role in COVID-19 temporal adaptation. It might not be the complete picture, but an important part of it.
A complete index with the different sequence analyses and code can be found here. While the epidemic curves analysis can be found here and the preprint can be found here. If you reached this point and want to help me to continue to develop these open-source models please consider joining one of the different support platforms listed in the following link. Avoid abrupt changes in solar radiation and see you in the next one.