Stacked MLP mixer for variational autoencoders

3 min readFeb 23, 2022

Convolutional neural networks are the main kind of layer used to define and train a neural network for an image-related task. And for some time that was the norm, then transformers offered similar performance gains without convolutions. One of the reasons might be because transformers’ base models are trained using patches of the image.

The MLP mixer is a neural network architecture that processes image data by patches. It contains two different MLPs in the architecture. One applied for each patch and the second one across patches, mixing channel and spatial information. Output is pooled and connected to a final dense layer for classification.

But, this architecture can be modified by reshaping the final output to have a squared output and using it as input for another MLP mixer. This increases the depth of the network and the downsampling operations can reduce the image more gradually.

With this simple idea, we can test for the reconstruction accuracy under different stacked MLP mixer architectures. Both architectures have different downsampling operations and different depths. In this case, the main comparison is in the reconstruction loss overall rather than specific combinations of parameters.

Five models of depth 6 vs 9 models depth 3

From both images, we can see that there’s a notorious difference in the reconstruction loss. An increase in network dept results in a 10 fold decrease in reconstruction loss. However, this increase in performance does not result in a noticeable increase in the number of parameters. Best performing models in both cases contain a similar number of parameters.

This particular flexibility of the Mixer-MLP architecture can be useful to reshape images before a convolutional network. Facilitating the use of oddly shaped images or using this architecture for other tasks. As always the complete code for this post can be found on my GitHub by clicking here. See you in the next one.

Stacked MLP mixer for variational autoencoders

Written by Octavio Gonzalez-Lugo

No responses yet