How to handle unlabeled data? A comprehensive guide to Variational Autoencoders (VAEs).

As we all know, we encounter all sorts of data in experiments. But imagine what would happen if we encountered unlabeled data? Most deep learning techniques require clean, labeled data, but is that realistic? Technically speaking, if you have a set of inputs and their respective target labels, you can try to understand the probability of a specific label for a specific target. Of course, is image mapping really that idyllic in reality? In this article, I will explore Variational Autoencoders (VAEs) to gain deeper insight into the world of unlabeled data. This model, after being trained on a collection of unlabeled images, will produce unique images.

An autoencoder sequentially deconstructs input data into hidden representations and uses these representations to sequentially reconstruct an output similar to its original content. It is essentially data-specific data compression, meaning it can only compress data similar to the data it was trained on. Of course, autoencoders are also known to be lossy, so the decompressed output will be slightly lower quality than the original input. So, one might wonder, if they cause quality loss, why are they so useful? That's a good question. It turns out they are very useful for data denoising; that is, we train an autoencoder to reconstruct the input from its own corrupted version, so it can eliminate similar corrupted data.

First, let's talk about Bayesian inference. Everyone reading this article is probably familiar with deep learning and its effectiveness in approximating complex functions; however, Bayesian inference offers a unique framework for explaining uncertainty, all of which is represented by probabilities. This makes sense; if you think about it carefully, at any given time, there is evidence to support or oppose what we already know, and this evidence can be used to create new probabilities. Furthermore, when we learn something new, we must consider what we already know and incorporate new evidence into our consideration, creating new probabilities. Bayesian theory essentially describes this concept mathematically.

VAEs are the product of these ideas. From a Bayesian perspective, we can consider the input, hidden representation, and reconstructed output of a VAE as probabilistic random variables in a directed graphical model. Assuming it contains a specific probabilistic model of some data, x, and latent/hidden variables z, we can write the joint probability of the model as follows:

Joint probability of the model

Given a character generated by the model, we don't know how the hidden variables were set to generate this character; our model is essentially random!

The VAE consists of three main parts:

• Encoder

• Decoder

• Loss Function

Given an input x, suppose we have a 28×28 image of a handwritten digit, which has 784 dimensions, where each pixel is one-dimensional. This will now be encoded into a latent/hidden representation space, which will be much smaller than 784. We can now sample the Gaussian probability density to obtain the noise values of the representation.

Isn't that cool? Let's represent this with code.

First, we import the library and find our hyperparameters.

Next, we initialize the encoder network. This network maps the input to hidden distribution parameters. We accept the input and send it through a dense, fully connected layer of ReLU (a classic non-linear activation function that compresses dimensions). Next, we transform the input data into two parameters in the hidden space. We use dense, fully connected layers—zmean and zlogsigma—of predefined size.

The decoder takes "z" as its input and outputs parameters to a probability distribution of the data. We assume each pixel is either 1 or 0 (black or white), and we can now use a Bernoulli distribution since it defines "success" as a binary value to represent a single pixel. Therefore, the decoder will receive a latent/hidden representation of a number as its input, and it will output 784 Bernoulli parameters, one for each pixel, resulting in 784 values between 0 and 1.

We will use z_mean and z_log_var to randomly sample new similar points from the hidden/latent normal distribution by defining a sampling function. In the following code block, epsilon is a random normal tensor.

Once we have z, we can feed it to our decoder, which will map these latent spatial points back to the original input data. Therefore, to build a decoder, we first initialize it with two fully connected layers and their respective activation functions. Because the data is extracted from a small dimension to a larger dimension, some of it will be lost during the reconstruction process.

Pretty cool, right? But how much exactly is "some"? To get an accurate value, we'll build a loss function to measure it precisely. The first term below measures the reconstruction loss. If the decoder output is bad at reconstructing the data, then the cost in terms of loss will be quite high. The next term is the regularization term, which means it keeps the representation of each number as diverse as possible. So, for example, if two different people write the number 3 at the same time, the resulting representations might look very different because different people will naturally write different things. This can be an undesirable result, and the job of the regularization term is to save the "undesirable"! We penalize bad behavior (as in this example) and ensure that similar representations are closely related. We can define the total loss function as the sum of the reconstruction term and the KL divergence regularization term.

Now we come to the training section. We typically use gradient descent to train this model to optimize our loss on the encoder and decoder parameters. But how do we derive these parameters from randomly determined variables?

We've essentially built randomness into our model itself. Gradient descent typically expects a given input to always return the same output with a fixed set of parameters. In our case, the only source of randomness would be the input. So how do we solve this? We redefine the parameters! We'll redefine the parameters for the samples so that randomness is independent of the parameters.

We will define a function that depends on the determinism of the parameters, thus allowing us to inject randomness into the model by introducing random variables. Instead of generating vectors of real values, the encoder will generate vectors of mean and standard deviation. We use the derivative of the function involving z with respect to its distribution parameters. We define the optimizer for the model as rmsprop and the loss function as vae_loss.

We begin training by importing the MNIST dataset and feeding it into our model, for a given number of training iterations and batch size.

Below, we plot the neighborhood on the two-dimensional plane. Each color cluster is represented by a number, while closed clusters are essentially numbers with similar structures.

Numerical representation

Another representation method is to generate numbers by scanning the latent plan, periodically sampling latent points, and generating corresponding numbers for these points, as shown below:

The generated numbers

This is quite shocking in some ways!

Therefore, this exercise essentially has three key points:

• Variational encoders allow us to generate data by performing unsupervised learning.

•VAE = Bayesian inference + deep learning.

• Reparameterization allows us to backpropagate through the network, and randomized, independent parameters enable us to derive gradients.

The complete code can be found on GitHub: https://github.com/vvkv/Variational-Auto-Encoders/blob/master/Variational%2BAuto%2BEncoders.ipynb

How to handle unlabeled data? A comprehensive guide to Variational Autoencoders (VAEs).

Read next

CATDOLL 128CM Emelie Open Eyes Type

CATDOLL 108CM Maruko

CATDOLL 128CM Cici (Customer Photos)

CATDOLL EQ (Sleepy Q) 108CM