How to use clustering performance to improve the architecture of a variational autoencoder
An autoencoder is one of the many different special neural network designs, the main objective of an autoencoder is to learn how to return the same data used to train it. The basic structure of an autoencoder can be split into two different networks, an encoder, and a decoder. The encoder compresses the data into a low dimensions space, while the decoder reconstructs the training data. In the middle of those two networks lays a bottleneck representation of the data. The bottleneck representation or latent space representation can be helpful for data compression, non-linear dimensionality reduction, or feature extraction. In a traditional autoencoder, the latent space could take any form as there is no constrain controlling the distribution of the latent variables in the latent space. A variational autoencoder rather than learn a single attribute in the latent space, learn a probability distribution for each latent attribute. The following post shows a simple method to optimize the architecture of a variational autoencoder using different performance measurements.
A classic data set used in machine learning is the MNIST dataset, that dataset is composed of a variety of images of handwritten numbers from zero to nine. With the rise in popularity of the MNIST dataset similar datasets have been created. One of those is the Fashion MNIST dataset, it consists of images of several clothing items, divided into ten different categories. Each image is a simple 28X28 grayscale image.
Before the definition of the variational autoencoder, we need to define the different custom layers needed to train the variational autoencoder. In Keras, a custom layer can be created by defining a new class that inherits the characteristics of a Layer class from Keras. For this particular autoencoder, two customs layers are going to be needed a sampling layer and a wrapper layer. The Sampling layer will take as input the layer before the bottleneck representation and will be used to constrain the values that the latent space can take. By sampling a normal distribution the variational autoencoder will learn a latent representation that is normally distributed. That characteristic can be useful to create new data that does not exist in the training data, that can be done with the decoder by using a sample from the same distribution as an input. For that characteristic variational autoencoders are also classified as generative models.
With the sampling layer defined the wrapper layer is defined similarly. However, the objective of this layer is to add a custom term to the model loss. A variational autoencoder loss is composed of two main terms. The first one the reconstruction loss, which calculates the similarity between the input and the output. And the distribution loss, that term constrains the latent learned distribution to be similar to a Gaussian distribution. The second loss term is added to a layer before the bottleneck representation and it adds the Kullback Leiber divergence as a dissimilarity between the learned distribution and the Gaussian distribution.
Creating the autoencoder
As all the custom layers are already defined the autoencoder can be created. In an autoencoder, the number of densely connected layers or convolutional blocks gradually down-sample the shape of the data into the latent space size (encoder) and then gradually return it to the original data size (decoder). If the same units are for the encoder and the decoder, then both networks are the mirror images of each other. We can use that characteristic to use the same function to create the encoder and the decoder, just by simply reversing the order of the elements of each layer. Then, adding the custom layers to the encoder.
Variational Autoencoder performance can be measured in a variety of forms, the simplest one could be to use the model loss as a measure of autoencoder performance. When a neural network is trained the stochastic gradient descent algorithm is used to minimize the loss function and to calculate the layer weights and biases.
However, using the model loss minimization only shows us how the model learns the data but it doesn’t tell us how useful the latent representation could be. As the variational autoencoder can be used for dimensionality reduction, and the number of different item classes is known another performance measurement can be the cluster quality generated by the latent space obtained by the trained network. We can apply k means clustering to the latent space and calculate the silhouette coefficient of the clusters and use it as a performance measurement of the network.
With the different performance measurements defined we can start to optimize the architecture of the variational autoencoder. First, a random set of possible architectures is created, each architecture follows the unique constrain that the first layer has the same size as the input, which will ensure that the decoder will have the same output shape as the input data. Then each architecture will be modified following one simple rule if the number of layers in the current architecture is greater than 4 one random layer will be removed, if that is not the case each layer will be added k units, where k is the location of the layer. For example, the first layer will have zero units added as the location of the first layer is zero.
On each round of performance optimization, each individual in the architecture population will be updated if the performance is better than the previous performance.
Under the loss based performance measurement, all the trained models obtained a similar loss value, however, some variational autoencoders with fewer layers show a similar performance as those with a higher number of layers.
While the cluster-based performance shows a little more variation between the results. Although variational autoencoders with a high number of layers and a low number of layers show high performance, the best performing variational autoencoder happens to be in the middle regarding the number of layers.
Know you have an example on how to define and use custom layers in Keras, how to add custom losses to a neural network, and how to merge everything using the functional API from Keras. Also how to develop a basic evolutionary algorithm using different performance measurements or different optimization objectives. The complete code for this post can be found in my GitHub by clicking here and the complete dataset y clicking here. See you in the next one.