Share this

Variational Autoencoder (VAE) Tutorial

2026-04-06 05:43:20 · · #1

1. Mysterious Variables and Datasets

We now have a dataset DX (dataset, also called datapoints), where each data point is also called a data point.

We assume this sample is controlled by some mysterious force, but we have no way of knowing what this force is. Let's assume there are n such forces, named power1, power2, ..., powern, with sizes z1, z2, ..., zn respectively, which we call mysterious variables. Represented as a vector...

Z also gave it a name: Mysterious Combination.

In short: mysterious variables represent mysterious combinations of mysterious forces.

In more formal terms, a latent variable represents the combination of latent factors.

Here we clarify the concept of membership space. Suppose the dataset DX has m points, these m points should also belong to a space. For example, in the case of one dimension, if each point is a real number, then its membership space is the set of real numbers. So here we define a space to which each point in DX belongs as XS. You will no longer find this unfamiliar when we mention it later.

The mysterious variable z can be confirmed to also have a home space called ZS.

Next, we will formally construct the mysterious relationship between X and Z. This relationship is the mysterious force we mentioned earlier. Intuitively, we already know very well that if our dataset is completely controlled by these n mysterious variables, then for each point in DX, there should be a mysterious combination zj of n mysterious variables to mysteriously determine it.

Next, we'll simplify this relationship further. Let's assume these n mysterious variables aren't the only factors controlling DX; there are other mysterious forces at play, which we'll ignore for now. We can use probability to compensate for this deficiency. Why? For example, suppose we've built a machine that can fire bullets at a fixed target. We've precisely calculated the impact force and angle, but due to uncontrollable factors like air currents or the Earth's rotation, the target may not be hit precisely. These factors might be enormous and numerous, but they aren't the primary factors forming DX. According to the law of large numbers, the influence of all these factors can be represented by a Gaussian probability density function. It looks like this:

When μ=0, it becomes like this:

This is the formula for a one-dimensional Gaussian distribution. What about multi-dimensional distributions? They're more complex, and look like this:

In any case, just remember that we are not able to focus on all the mysterious variables right now. We only care about a few factors that may be important. There are various assumptions about the distribution of these factors. We will discuss their probability distribution later. For now, we assume that we also know nothing about their specific distribution. We only know that they are in the ZS space.

We previously discussed a mysterious combination. If a dataset X has exactly the same mysterious combination, then it's a single-class dataset. If there are multiple mysterious combinations, it's a multi-class dataset. However, if it's a continuous combination of data, then it becomes a complex dataset with blurred boundaries. For example, if our dataset is a set of line segments, with the length of the segments being the only mysterious variable, then as long as the length varies continuously within a certain range, the line segments in this set will appear very evenly distributed, making it almost impossible to distinguish them or classify them. But if the length value can only be 1, 3, or 5, then when you observe this dataset, you'll find they cluster into three groups. If the generation of these line segments relies entirely on a computer, then each group will perfectly overlap. But if they are drawn by humans, errors may prevent perfect overlap. These non-overlapping parts are the other complex factors we mentioned, which we usually represent using a Gaussian distribution. Okay, we have a basic understanding now; let's give this mysterious combination a formal description.

Suppose there are two variables, z∈ZS and x∈XS, and there exists a family of deterministic functions f(z;θ), where each function in the family is uniquely determined by θ∈Θ, f:ZS×Θ→XS. When θ is fixed and z is a random variable (with a probability density function of Pz(z)), then f(z;θ) is a random variable x defined on XS, and the corresponding probability density function can be written as g(x).

Our goal is to optimize θ to find an f such that the sampling of random variable x is very similar to X. It's important to note here that x is a variable, DX is an existing dataset, and x does not belong to DX; I've deliberately chosen names that distinguish them.

Thus, f is the channel of that mysterious power. It transforms the intensity of these mysterious powers into the variable x through f, and this variable x is a random variable that is directly related to the dataset DX.

Let a dataset be DX, and the probability of its existence be Pt(DX). According to Bayes' theorem, we have:

Here, f is our newly defined probability density function. We know from before that f maps z to x, and x has a direct relationship with DX. This direct relationship can be expressed as: Then...

In this way, we can directly define a replacement to represent the relationship between z and DX.

Okay, so formula (1) is actually the mysterious relationship between our mysterious power and the observed dataset. To put it simply, this relationship means that when the hidden variables exist according to a certain pattern, it is very easy to generate the dataset we see now. So, what we need to do is, when we assume there are n mysterious powers, we can find a magical function f that transforms the changes of the mysterious powers into changes of magical x, which can easily generate the dataset DX.

From the above description, we can see that f is the generation transformation function. Formula (1) does not represent this transformation relationship, but rather the maximum likelihood estimate of this relationship. It means finding the dominant function f that is most likely to generate the DX dataset.

Next, let's return to our discussion of this probability density function. As we mentioned earlier, if z represents all the mysterious forces, then the variable x it produces must be fixed; that is, when z takes a fixed value, x takes a fixed value. However, in reality, many other factors exist, so the value of x is also related to them. Their influence is ultimately reflected in a Gaussian function. Therefore, we boldly assume it to be a Gaussian probability density function, i.e.

Note that the distribution of z is still unknown to us.

Suppose we know that z is currently taking one or more specific values. Then we can use GradientDescent to find a θ that maximizes the probability of z generating the desired dataset DX. Generalizing this, we encounter the problem of probabilities for z taking various values ​​within a range, but specifically a few values ​​or a range. We'll discuss this tricky problem later. For now, just understand that, implicitly, we seem to be able to find the optimal solution by learning the parameter θ.

Okay, we also need to address a key issue: we are certain that f exists, and we believe that the relationship between the variable and the mystery variable can be represented by a function.

2. Variational Autoencoder (VAE)

In this section, we explore how to maximize formula (1). First, we need to discuss how to determine the mysterious variable z, that is, how many dimensions z should have, and what the scope of each dimension is. More meticulously, we might even need to investigate what each dimension represents, whether they are independent of each other, and what the probability distribution of each dimension is.

If we continue down this path, we will get bogged down. We can cleverly avoid these problems by keeping them "mysterious"!

We don't care what each dimension represents; we only assume the existence of a group of independent variables. Returning to our previous discussion, although we don't know the exact number of dimensions, we can assume there are n main factors. n can be large, for example, assuming 4 main factors. If we assume there are 10, then after training, 6 of them might have a long-term value of 0. The final issue, which is quite complex, requires detailed discussion: the probability distribution and values ​​of z.

Since we don't know what z is, can we find a new set of mysterious variables w that follow a standard normal distribution N(0,I), where I is the identity matrix? Then, can w be transformed into z using n complex functions? This is feasible with neural networks. Let's assume these complex functions are h1, h2, ..., hn, then z1 = h1(w1), ..., zn = hn(wn). We don't need to worry about the specific distribution or range of z; a neural network will handle that. Recall that if f(z;θ) is a multi-layered neural network, the first few layers transform the standard normal distribution w into the actual latent variable z, while the later layers map z to x. However, since w and z have a one-to-one correspondence, w is, in a sense, also a mysterious force. This evolves into a relationship between w and x. Since w is also a mysterious variable, let's call it z again and forget about the mysterious variable we previously thought of.

Alright, an even more magnificent journey is about to begin, please take your seats.

We already have

Now we can focus on attacking f, since f is a neural network, we can use gradient descent. But another key point is how do we know which sample generated by f is more similar to DX? If this problem cannot be solved, we won't even know what our objective function is.

3. Define the objective function

Let's first define a function Q(z|DX), which represents the probability density function of z given the occurrence of dataset DX. In other words, if DX occurs, Q(z|DX) is the probability density function of z. For example, for a digital image of 0, z implicitly represents 0 with a high probability, while the probability of z representing 1 is very low. If we can obtain this functional representation of Q, we can directly use DX to calculate the optimal value of z. Why introduce Q? The reason is simple: if DX is directly generated from the variable x, to find the model of x, we need to introduce a probability density function T(x|DX). That is, for DX, we need to find an optimal probability density function for x.

The problem now becomes how to calculate Q(z|DX) based on DX and make it as close as possible to the ideal Pz(z|DX). This requires a more advanced technique – relative entropy, also known as KL divergence (denoted by D).

We won't name P here. Actually, it's perfectly fine to write Pz(z) directly as P(z). The above is just to distinguish the concepts, and the content in parentheses is sufficient to convey the meaning.

Since logP(DX) is independent of the z variable, it can be directly extracted, thus obtaining the brilliant formula (2):

Formula (2) is the core formula of VAE. We will analyze this formula next.

The left side of the equation contains our optimization objective P(DX), along with an error term. This error term reflects the relative entropy between the true distribution Q and the ideal distribution P given DX. When Q perfectly matches the ideal distribution, this error term is 0. The right side of the equation is what we can optimize using gradient descent. Here, Q(z|DX) is like an encoder from DX to z, and P(DX|z) is like a decoder from z to DX. This is why the VAE architecture is also called an autoencoder.

Since there is no longer any disagreement about DX, we have replaced all DX with X here.

We now have the decomposition of formula (2):

And the following:

Let's clarify the meaning of each probability:

P(X) represents the probability of an event occurring in the current dataset, but we don't know its probability distribution. For example, if X is a one-dimensional finite space, taking only integer values ​​from 0 to 9, and our X = {0, 1, 2, 3, 4}, then when the probability distribution is uniform, P(X) is 0.5. However, if it's not this distribution, it's hard to say what it will be; it could be 0.1, 0.01, or even something else entirely. P(X) is a function, much like a person who can tell you the probability of X occurring when you ask them for a certain value.

P(z) – This z is the w we introduced later, remember? They have all been normalized to the normal distribution. If z is one-dimensional, then it is the standard normal distribution N(0,I).

P(X|z) – This function means that if z is given a value, then we know the probability that X takes a certain value. For example, z is a magical variable that controls the entire screen's red color and its grayscale. z follows an N(0,1) distribution; when z=0, it represents pure red. The further z deviates from 0, the deeper the red on the screen. P(X|z) represents the probability that X equals a different value when z equals a certain value. Since the computer is precisely controlled without additional random factors, if z=0 can lead to X taking a fixed color value of 0xFF0000, then... However, if the real world is more complex and includes other random factors, then it's possible to add randomness to the base value of X determined by z. This is what we discussed earlier, the law of large numbers. f(z) is a direct representation of the relationship between X and z.

P(z|X) – What is the probability of z occurring when X occurs? Returning to the example on the computer screen, it becomes very simple. However, due to the introduction of probability, X|z can be simplified to a Gaussian relationship, and conversely, it can also be simplified to a Gaussian relationship. This explanation also applies to Q below.

Q(z) – The analysis of Q is the same as that of P, except that we assume P is the ideal distribution, the true force behind the final composition of X, while Q is our "son," an attempt to create a counterfeit, hoping to control the generation of X in the real world through neural networks. When Q truly behaves exactly like our ideal P, Q is a superior counterfeit, even deserving the authentic label. Our P has been simplified to N(0,I), meaning Q can only approach N(0,I).

Q(z|X) is a probability function derived from the relationship between X and Q in reality. It represents the probability distribution of the value of z when X occurs.

Q(X|z)——The probability that z takes the value X in reality.

Our goal is to optimize P(X), but we don't know its distribution, so we can't optimize it at all, which is because we have no prior knowledge. Therefore, we have formula (2), where the second term on the left is the relative entropy of P(z|X) and Q(z|X), which means that the actual distribution when X occurs should be similar to our ideal distribution. So the entire left side is our optimization target. The larger the left side is, the better. Then the target on the right side is also the larger the better.

The first item on the right is the distribution of X calculated in response to the actual distribution of z (which depends on Q(z|X) and is determined by the mapping relationship from X to z). It is similar to the process of reconstructing X based on z.

The second term on the right is to make the z reconstructed from X as close as possible to the real z. Since P(z) is definite N(0,I) and Q(z|X) is also normally distributed, the goal is to make Q(z|X) approach the standard normal distribution.

Now we have a deeper understanding of this formula. Next, we will proceed with its implementation.

4. Implementation

Implement each of the two items on the right.

The second term is the relative entropy between Q(z|X) and N(0,I), where X->z constitutes the encoder part.

Q(z|x) is a normal distribution. The KL formula for calculating two normal distributions is as follows:

det is the determinant, tr is the rank of the matrix, and d is the rank of I, i.e., d = tr(I).

To transform this into specific neural networks and matrix operations, further modifications to the formula are needed:

OK, we've also calculated KL. The next thing is the encoder network; we can use neural networks to encode both.

The first term represents that the data reconstructed by z should be as similar as possible to X. The reconstruction of X by z->X constitutes the decoder part. The key to the entire reconstruction is the f function, which for us is to build a decoder neural network.

At this point, all the details of the implementation are shown in the following diagram.

Because one part of this network transmission structure involves random sampling, it cannot propagate backwards. Therefore, clever predecessors optimized this structure as follows:

This allows for backpropagation training of the entire network.

The specific implementation code is here:

https://github.com/vaxin/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/variational_autoencoder.py

Each step within is accompanied by a corresponding explanation in this article.

Read next

A Brief Discussion on Artistic Design in Human-Computer Interface Design

[Abstract] With people's higher pursuit of life and the development of ergonomics, art design is increasingly valued...

Articles 2026-02-22