Deep Understanding of Variational Autoencoders

[Deep Understanding of Variational Autoencoders]

Introduction: Autoencoders are a very intuitive unsupervised neural network method, consisting of an encoder and a decoder. Autoencoders have become increasingly popular among researchers in recent years. This article, a fantastic blog post by machine learning engineer Jeremy, introduces the theoretical foundations and working principles of variational autoencoders, using face examples to help readers understand them more intuitively. The article emphasizes the theoretical derivation and implementation details of variational autoencoders, and at the end, it shows the output of a variational autoencoder as a generative model. Readers who wish to gain a deeper understanding of variational autoencoders should definitely read this.

Variational autoencoders

Variational autoencoder

An autoencoder is a model that discovers some latent representations (incomplete, sparse, denoised, or contracted) of data. More specifically, the input data is transformed into an encoded vector, where each dimension represents a property learned from the data. Crucially, the encoder outputs a single value for each encoded dimension, and the decoder then receives these values and attempts to recreate the original input.

Variational autoencoders (VAEs) provide a probabilistic way to describe latent space observations. Therefore, instead of building an encoder that outputs a single value to describe each latent state attribute, we can use the encoder to describe the probability distribution of each latent attribute.

Intuition

For example, suppose we have trained an autoencoder model on a large face dataset , with an encoder dimension of 6. Ideally, we want the autoencoder to learn descriptive attributes of the face, such as skin color and whether the person is wearing glasses, so that these attributes can be represented by some feature values.

In the example above, we used a single value to describe the latent properties of the input image. However, we would prefer to represent each latent property with a distribution. For example, given a picture of the Mona Lisa, it's difficult to confidently assign a specific value to the smile attribute , but with a variational autoencoder , we can say with more confidence what distribution the smile attribute follows.

In this way, we now represent each latent attribute of a given input as a probability distribution. When decoding from the latent states, we randomly sample from each latent state distribution to generate a vector as input to the decoder.

Note: For variational autoencoders, the encoder is sometimes referred to as the recognition model, while the decoder is sometimes referred to as the generative model.

By constructing our encoder to output a range of possible values (a statistical distribution), and then randomly sampling these values as input to the decoder, we are able to learn a continuous, smooth latent space. Therefore, values that are adjacent to each other in the latent space should correspond to very similar reconstructions. And for any sample sampled from the latent distribution, we expect the decoder to understand and accurately reconstruct it.

▌Statistical Motivation

Suppose there are some hidden variables that generate the observations.

We can only see the features, but we want to infer them. In other words, we want to calculate them.

Unfortunately, the calculations are quite difficult.

This is usually a tricky problem. However, we can use variational inference to estimate this value.

We want to approximate it using a relatively simple distribution. If we can determine the parameters and guarantee that it is very similar to the distribution, it can sometimes be used for approximate reasoning.

KL divergence is a measure of the difference between two probability distributions. Therefore, if we want to ensure similarity, we can minimize the KL divergence between the two distributions.

Dr. AliGhodsi here demonstrates a complete derivation, showing that minimizing the above expression is equivalent to maximizing the following expression:

The first term represents the likelihood estimate of the reconstruction, and the second term ensures that the distribution q we learn is similar to the true prior distribution p.

https://www.youtube.com/watch?v=uaaqyVS9-rM&feature=youtu.be&t=19m42s

To re-examine our graphical model, we can use it to infer possible latent variables (i.e., latent states) for generating observations. We can further construct this model as a neural network architecture, where the encoder learns a mapping from to, and the decoder model learns a mapping from to.

The loss function of this network will consist of two terms: one penalizes the reconstruction error (which can be considered as maximizing the reconstruction probability, as mentioned earlier), and the second encourages us to learn a distribution similar to the true distribution. For each dimension of the latent space, we assume that the prior distribution follows a unit Gaussian distribution.

▌Achieve

The preceding chapters established the statistical motivation for the variational autoencoder structure. In this section, I will provide the implementation details of how I built this model.

Unlike standard autoencoders that directly output the hidden state values, VAE encoder models output a distribution for each dimension. Since we assume the prior p(z) follows a normal distribution, we will output two vectors to describe the mean and variance of the hidden state distribution. If we wanted to build a true multivariate Gaussian model, we would need to define a covariance matrix to describe how each dimension is correlated. However, we will make a simplifying assumption that our covariance matrix has non-zero values only on the diagonal, allowing us to describe this information using simple vectors.

Then, our decoder will generate a latent vector by sampling from these defined distributions and begin reconstructing the original input.

However, this sampling process requires additional attention. When training the model, we use backpropagation to calculate the relationship between each parameter in the network and the final output loss. However, we cannot do this for a random sampling process. Fortunately, we can utilize a clever idea called "Reparameterization," which involves randomly sampling ε from a unit Gaussian distribution, then multiplying the randomly sampled ε by the mean μ of the latent distribution and scaling it by the variance σ of the latent distribution.

Through this reparameterization, we can now optimize the parameters of the distribution while still maintaining the ability to sample randomly from that distribution.

Note: To handle the fact that the network may learn negative values for σ, we usually learn logσ through the network and exponentialize this value to obtain the variance of the latent distribution.

Visualization of Latent Space

To understand the implications of variational autoencoder models and their differences from standard autoencoder architectures, it is necessary to examine the latent space . This blog post provides a good discussion on this topic, which I will summarize in this section.

The main advantage of variational autoencoders is our ability to learn a smooth latent state representation of the input data. With standard autoencoders, we only need to learn one encoding that allows us to reproduce the input. As you can see in the leftmost figure, focusing solely on the reconstruction loss does allow us to separate different classes (MNIST digits in this case), which allows our decoder model to reproduce the original handwritten digits. However, the data distribution in the latent space may be uneven. In other words, some regions in the latent space do not represent any of the data we observed.

On the other hand, if we only care about the latent distribution being similar to the prior distribution (through our KL divergence loss term), we will end up describing each observation with the same unit Gaussian, and in subsequent sampling and visualization , it will look like the middle figure above. In other words, we have failed to describe the original data.

However, when optimizing both formulas simultaneously, we want to have a latent state that closely approximates the prior distribution to represent the attributes , and we also want the reconstruction error to be relatively small .

When I build a variational autoencoder, I like to examine the latent dimensions of several samples in the data to see the characteristics of the distribution.

If we observe that the latent distribution is very narrow, we can assign higher weights to the KL divergence term with a parameter β > 1, encouraging the network to learn a broader distribution . This simple observation leads to the development of a new class of models—disentangled variational autoencoders. It turns out that by placing greater emphasis on the KL divergence term, we also implicitly emphasize that the learned latent dimensions are uncorrelated (through our simplifying assumption about the diagonal covariance matrix).

Variational autoencoder as a generative model

By sampling from the latent space, we can use a decoder network to form a generative model that can create new data similar to what was observed during training. Specifically, we will sample from the previous distribution, assuming it follows a unit Gaussian distribution.

The figure below shows the data generated by the decoder network of a variational autoencoder trained on the MNIST handwritten digit dataset. Here, we sampled a grid of values from a 2D Gaussian and illustrate the output of the decoder network.

As you can see, each individual number exists in a different region of the latent space and smoothly transforms from one number to another. This smooth transformation can be very useful if you want to interpolate between two observation points.

Deep Understanding of Variational Autoencoders

Read next

CATDOLL Cici Hard Silicone Head

CATDOLL Charlotte Hard Silicone Head

CATDOLL 109CM Dora Full Silicone Doll

CATDOLL 108CM Coco