Variational Autoencoder (VAE): So that's how it works.

Although I hadn't looked into it in detail before, I'd always had the impression that Variational Auto-Encoders (VAEs) were something good. Taking advantage of my recent burst of interest in probabilistic graphical models, I decided to try and understand VAEs as well.

So I went through a lot of information online, but without exception, I found them all to be very vague. The main feeling was that even after writing a long list of formulas, I was still confused. Finally, when I thought I understood it, I looked at the implementation code and realized that the implementation code was completely different from the theory.

Finally, after piecing together various ideas, adding my recent accumulation of knowledge on probabilistic models, and repeatedly comparing them with the original paper Auto-Encoding Variational Bayes, I think I've figured it out.

The real VAE is quite different from what many tutorials describe. Many tutorials are lengthy but fail to explain the key points of the model. So I wrote this article, hoping to provide a basic and clear explanation of VAEs.

Distribution transformation

We often compare VAEs with GANs. Indeed, their goals are basically the same—to build a model that generates target data X from latent variables Z—but their implementations differ.

More precisely, they assume that the data follows certain common distributions (such as normal or uniform distributions) and then hope to train a model X=g(Z) that can map the original probability distribution to the probability distribution of the training set. In other words, their purpose is to transform between distributions.

The challenge of generative models is determining the similarity between the generated distribution and the true distribution, because we only know the sampling results of both, but not their distribution expressions.

Now, assuming it follows a standard normal distribution, I can sample several Z1, Z2, ..., Zn from it, and then transform them to get X̂1=g(Z1), X̂2=g(Z2), ..., X̂n=g(Zn). How do we determine whether the distribution of this dataset constructed by f is the same as the distribution of our target dataset?

Some readers might ask, "Isn't there KL divergence?" Of course not, because KL divergence calculates the similarity between two probability distributions based on their expressions, but we currently do not know the expressions for their probability distributions.

We only have one set of data {X̂1,X̂2,…,X̂n} sampled from a constructed distribution, and another set of data {X1,X2,…,Xn} sampled from the real distribution (which is the training set we want to generate). We only have the samples themselves, not the distribution expression, and therefore no way to calculate the KL divergence.

Although we encountered difficulties, we still had to find a way to solve them. The idea behind GANs is very direct and straightforward: since there is no suitable metric, I might as well train this metric using a neural network as well.

And so, WGAN was born. For a detailed explanation, please refer to "The Art of Debating: From Zero to WGAN-GP". VAE, on the other hand, used a more refined and roundabout technique.

VAE Slow Talk

In this section, we will first review how VAE is introduced in general tutorials, then explore what problems exist, and then naturally discover the true nature of VAE.

Classic Review

First, we have a batch of data samples {X1,…,Xn}, which are described by X. We want to obtain the distribution p(X) of X based on {X1,…,Xn}. If we can obtain it, then we can directly sample based on p(X) to obtain all possible X (including those other than {X1,…,Xn}). This is the ultimate ideal generative model.

Of course, this ideal is difficult to achieve, so we will modify the distribution:

Here we won't distinguish between summation and integration; as long as the meaning is correct, that's fine. At this point, p(X|Z) describes a model that generates X from Z, and we assume Z follows a standard normal distribution, i.e., p(Z) = N(0, I). If this ideal is realized, then we can first sample a Z from the standard normal distribution, and then calculate an X based on Z, which is also a great generative model.

The next step involves using an autoencoder to reconstruct the model, ensuring no useful information is lost. This is followed by a series of derivations, ultimately leading to the model implementation. A schematic diagram of the framework is shown below:

▲Traditional understanding of VAE

Do you see the problem? If it's like this graph, we don't really know whether the resampled Zk still corresponds to the original Xk. So it's very unscientific to directly minimize D(X̂k,Xk)^2 (where D represents some kind of distance function). In fact, if you look at the code, you'll find that it's not implemented like this at all.

In other words, many tutorials talk a lot of eloquently, but when it comes to writing code, they don't follow the written text at all, yet they don't see any contradiction in this.

VAE first appeared

In fact, in the entire VAE model, we did not use the assumption that p(Z) (prior distribution) is normally distributed; instead, we used the assumption that p(Z|X) (posterior distribution) is normally distributed.

Specifically, given a real sample Xk, we assume that there exists a distribution p(Z|Xk) (also known as the posterior distribution) that is specific to Xk, and further assume that this distribution is a (independent, multivariate) normal distribution.

Why emphasize "exclusive"? Because we will train a generator X=g(Z) later, hoping to restore a Zk sampled from the distribution p(Z|Xk) to Xk.

If we assume that p(Z) is a normal distribution, and then sample a Z from p(Z), how do we know which real X this Z corresponds to? Now that p(Z|Xk) belongs exclusively to Xk, we have reason to say that the Z sampled from this distribution should be restored to Xk.

In fact, this point is specifically emphasized in the application section of the paper on Auto-Encoding Variational Bayes:

In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:

Equation (9) in the paper is the key to realizing the entire model, and I don't know why many tutorials don't highlight it when introducing VAE. Although the paper also mentions that p(Z) is a standard normal distribution, that is not actually essential.

To reiterate, at this point, each Xk is assigned a unique normal distribution, which facilitates the subsequent generator's reconstruction. However, this results in as many normal distributions as there are Xs. We know that a normal distribution has two sets of parameters: the mean μ and the variance σ^2 (in multivariate distributions, these are vectors).

How do I find the mean and variance of the normal distribution p(Z|Xk) that is specific to Xk? There doesn't seem to be a direct approach.

Okay then, I'll use a neural network to fit the result. This is the philosophy of the neural network era: we use neural networks to fit the difficult calculations. We've already experienced this once with WGAN, and now we're experiencing it again.

Therefore, we construct two neural networks μk=f1(Xk) and logσ^2=f2(Xk) to calculate them. We choose to fit logσ^2 instead of directly fitting σ^2 because σ^2 is always non-negative and requires an activation function, while fitting logσ^2 does not require an activation function because it can be positive or negative.

At this point, I know the mean and variance specific to Xk, and I know what its normal distribution looks like. Then, I sample a Zk from this specific distribution and pass it through a generator to get X̂k=g(Zk).

Now we can safely minimize D(X̂k,Xk)^2, because Zk is sampled from the distribution specific to Xk, and this generator should restore the initial Xk. Thus, we can draw a schematic diagram of the VAE:

In fact, VAE constructs a unique normal distribution for each sample and then reconstructs it by sampling.

Distribution Standardization

Let's think about what the final result will be based on the training process shown in the diagram above.

First, we want to reconstruct X, which means minimizing D(X̂k,Xk)^2. However, this reconstruction process is affected by noise because Zk is resampled and not directly calculated by the encoder.

Obviously, noise increases the difficulty of reconstruction. However, the noise intensity (i.e., variance) is calculated by a neural network, so the final model will try its best to make the variance zero in order to reconstruct better.

If the variance is 0, there is no randomness. So no matter how you sample, you only get a definite result (mean). Fitting only one is easier than fitting multiple. The mean is calculated by another neural network.

To put it simply, the model will gradually degenerate into a regular AutoEncoder, and noise will no longer have any effect.

Wouldn't that be a waste of effort? What happened to the generative model we were promised?

Don't rush, don't rush. In fact, VAE also makes all p(Z|X) conform to the standard normal distribution, which prevents the noise from being zero and ensures that the model has generative capabilities.

How do we understand "guaranteeing generation capability"? If all p(Z|X) are very close to the standard normal distribution N(0,I), then according to the definition:

This allows us to meet our prior hypothesis: p(Z) follows a standard normal distribution. We can then confidently sample from N(0,I) to generate images.

In order for the model to have generative capabilities, VAE requires that each p(Z_X) aligns with a normal distribution.

So how do we make all p(Z|X) align with N(0,I)? Without external knowledge, the most direct approach would be to add an additional loss to the reconstruction error:

Since these represent the mean μk and the logarithm logσ^2 of the variance, respectively, achieving N(0,I) means that both should be as close to 0 as possible. However, this raises the question of how to choose the proportions of these two losses; if not chosen well, the generated image will be blurry.

Therefore, the original paper directly calculated the KL divergence KL(N(μ,σ^2)‖N(0,I)) between the general (independent components) normal distribution and the standard normal distribution as this additional loss, and the calculation result is:

Here, d is the dimension of the latent variable Z, and μ(i) and σ_{(i)}^{2} represent the i-th components of the mean vector and variance vector of the general normal distribution, respectively. Using this formula directly as the supplementary loss eliminates the need to consider the relative proportions of the mean loss and variance loss.

Clearly, this loss can also be understood in two parts:

Derivation

Since we are considering a multivariate normal distribution with independent components, we only need to derive the case of a univariate normal distribution. According to the definition, we can write:

The entire result consists of three integrals. The first term is actually the integral of −logσ^2 multiplied by the probability density (which is 1), so the result is −logσ^2. The second term is actually the second moment of the normal distribution; those familiar with the normal distribution will know that the second moment of the normal distribution is μ^2 + σ^2. And according to the definition, the third term is actually "-variance divided by variance = -1". Therefore, the final result is:

Reparameter Techniques

Finally, here's a technique for implementing the model, called the Reparameterization Trick. I'll just call it parameter recalculation here.

▲Re-parameter techniques

It's actually quite simple. We want to sample a Zk from p(Z|Xk). Although we know p(Z|Xk) is normally distributed, the mean and variance are calculated from the model. We need to use this process to optimize the mean and variance model. However, the "sampling" operation is not differentiable, while the sampling result is differentiable. So we utilize a fact:

Therefore, we changed the sampling from N(μ,σ^2) to sampling from N(μ,σ^2), and then obtained the result of sampling from N(μ,σ^2) through parameter transformation. In this way, the "sampling" operation no longer needs to participate in gradient descent, but instead the sampling result participates, making the entire model trainable.

To understand how to implement it, simply compare the above text with the code, and you'll immediately understand.

Further analysis

Even if we understand all of the above, we may still have many questions about VAE.

What is the essence?

What is the essence of VAE? Although VAE is also called a type of AE (AutoEncoder), its approach (or its interpretation of the network) is unique.

In VAE, there are two encoders: one for calculating the mean and the other for calculating the variance. This is already surprising: the encoder is not used for encoding, but for calculating the mean and variance. This is big news. Also, aren't the mean and variance both statistics? How can they be calculated using a neural network?

In fact, I think that VAE started with variational and Bayesian theories that are daunting to ordinary people, and finally landed on a specific model. Although it took a long way, the final model is actually very down-to-earth.

Essentially, it adds Gaussian noise to the encoder's output (which corresponds to the mean calculation network in VAE) on top of our regular autoencoder , making the decoder robust to noise. The additional KL loss (aiming to make the mean 0 and the variance 1) is actually equivalent to a regularization term on the encoder, hoping that the encoder output will have a zero mean.

What is the role of the other encoder (corresponding to the network that calculates variance)? It is used to dynamically adjust the intensity of noise.

Intuitively, when the decoder is not yet well trained (the reconstruction error is much greater than the KL loss), the noise will be reduced appropriately (the KL loss will increase), making it easier to fit (the reconstruction error will begin to decrease).

Conversely, if the decoder is trained well (reconstruction error is less than KL loss), noise will increase (KL loss decreases), making fitting more difficult (reconstruction error starts to increase again). In this case, the decoder needs to find ways to improve its generation ability.

▲The essential structure of VAE

Simply put, the reconstruction process aims for noise-free operation, while KLloss aims for Gaussian noise; the two are contradictory. Therefore, VAEs, like GANs, actually contain an antagonistic process internally, except that they are mixed and co-evolve.

From this perspective, VAE's approach seems more sophisticated, because in GANs, while the forger evolves, the detector remains unaffected, and vice versa. Of course, this is only one aspect and cannot prove that VAE is better than GAN.

The truly brilliant aspect of GAN is that it directly trains the metric, and this metric is often better than what we can imagine (however, GAN itself also has various problems, which I won't go into here).

normal distribution?

Regarding the distribution of p(Z|X), readers may wonder: Is it necessary to choose a normal distribution? Can a uniform distribution be chosen?

First, this is essentially an experimental question; you can find the answer by trying both distributions. However, intuitively, the normal distribution is more reasonable than the uniform distribution because the normal distribution has two independent sets of parameters: mean and variance, while the uniform distribution only has one.

As we mentioned earlier, in VAE, reconstruction and noise are mutually antagonistic. Reconstruction error and noise intensity are two mutually antagonistic indicators. In principle, when changing the noise intensity, there should be the ability to keep the mean constant. Otherwise, it is difficult to determine whether the increase in reconstruction error is due to a change in the mean (the encoder's fault) or an increase in variance (the noise's fault).

Since a uniform distribution cannot change the variance while keeping the mean constant, a normal distribution should be more reasonable.

Where are the variations?

Another interesting (but not very important) question is: VAE is called "Variational Autoencoder ," so what is its connection with variational methods? In the VAE papers and related interpretations, variational methods don't seem to be mentioned at all.

If the reader has already accepted the KL divergence, then VAEs seem to have little to do with variational geometry, because the definition of KL divergence is:

If it is a discrete probability distribution, it should be written as a summation. We need to prove that: given a probability distribution p(x) (or a fixed q(x)), for any probability distribution q(x) (or p(x)), we have KLp(x)‖q(x))≥0, and it is equal to zero only when p(x)=q(x).

Since KL(p(x)‖q(x)) is actually a functional, finding the extremum of a functional requires the use of variational methods. Of course, the variational methods here are just a parallel generalization of ordinary calculus and do not yet involve truly complex variational methods. The variational lower bound of VAE is directly obtained based on the KL divergence. Therefore, if we directly acknowledge the KL divergence, variational methods are irrelevant.

In short, the name VAE includes "variable" because its derivation process uses KL divergence and its properties.

Conditional VAE

Finally, since the current VAE is unsupervised training, it is natural to think: if there is labeled data, can the label information be added to help generate samples?

The intention behind this problem is often to control a certain variable to generate a specific type of image. This is certainly possible, and we call this a ConditionalVAE, or CVAE (correspondingly, in GANs we also have CGAN).

However, CVAE is not a specific model, but a class of models. In short, there are many ways to incorporate label information into VAE, and the purposes are different. Based on the previous discussion, we will present a very simple VAE here.

▲A simple CVAE structure

In the preceding discussion, we hoped that after X is encoded, the distribution of Z would have zero mean and unit variance. This "hope" is achieved by adding KL loss.

If we now have additional category information Y, we can hope that samples of the same class have a unique mean μ^Y (with the variance remaining constant, still unit variance), and let the model train itself to obtain this μ^Y.

In this way, there will be as many normal distributions as there are classes, and during generation, we can control the class of the generated image by controlling the mean.

In fact, this might be the solution for implementing CVAE with the least amount of code on top of VAE, because this "new hope" can be achieved simply by modifying KLloss:

The image below shows that this simple CVAE has some effect, but because the encoder and decoder are relatively simple (pure MLP), the effect of controlling the generation is not perfect.

Using this CVAE to control the generation of the number 9, we can see that it generates various styles of 9, and gradually transitions to 7, so the initial observation shows that this CVAE is effective.

Readers are encouraged to study more complete CVAE models on their own. Recently, a work combining CVAE and GAN, CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training, has also been released, with a wide variety of model approaches.

Variational Autoencoder (VAE): So that's how it works.

Read next

CATDOLL 123CM Maria (TPE Body with Hard Silicone Head)

CATDOLL 102CM Ling Anime Doll

CATDOLL Kelsie Soft Silicone Head

CATDOLL 138CM Qiu Silicone Doll