An autoencoder is a program whose goal is to make its output equal to its input (try to copy its input to its output), but by using a function to generate an intermediate value that is not equal to the input. For example, in the transformation x—y—z, x is the input, y is an intermediate value after the function transformation, and z is the output value of y after the transformation. The purpose of the autoencoder is to make z equal to x, while y is not equal to x. In this transformation, y carries all the information of x, which is extremely meaningful. Specifically, what is its significance? It improves the efficiency of data classification!
Before delving into the implementation principles of autoencoders , it's essential to understand that we're utilizing artificial neural networks. So, what is a neural network? Simply put, a neural network performs nonlinear transformations on the original signal layer by layer, as shown in the diagram below.
Neural networks are often used for classification, aiming to approximate the transformation function from the input layer x to the output layer y, where h represents multiple hidden layers. Therefore, we define an objective function to measure the difference between the current output and the true result, using this function to progressively adjust (e.g., through gradient descent) the system parameters to make the entire network fit the training data as closely as possible. If there are regularization constraints, the model should also be as simple as possible (to prevent overfitting).
Implementation principle of autoencoder
Given a neural network, assuming its output is identical to its input, we train it, adjust its parameters, and obtain the weights in each layer. Naturally, we obtain several different representations of the input (each layer representing one representation), and these representations are the features. An autoencoder is a neural network that attempts to reproduce the input signal as accurately as possible. To achieve this reproduction, an autoencoder must capture the most important factors that represent the input data, finding the main components that represent the original information.
1) Given unlabeled data, learn features using unsupervised learning:
As shown in the diagram above, when we input the input signal into an encoder, we get a code, which represents the input. How do we know that this code represents the input? We add a decoder. The decoder then outputs information. If this output information is very similar to the initial input signal (ideally, it should be identical), then we can clearly believe that the code is reliable. Therefore, we adjust the parameters of the encoder and decoder to minimize the error (reconstruction error) between the output and input. At this point, we obtain the first representation of the input signal, which is the encoded code. The output decoder is no longer important because the code already represents all the information of the input; that is, the code already represents the input.
2) Features are generated through the encoder, and then the next layer is trained. This process is repeated layer by layer.
The training of the second layer then uses the code from the first layer as its input, and similarly, its output is made equal to the input, resulting in code2 representing code, code3 representing code2 in the third layer, and so on, up to multiple layers. The term "deep" in deep learning originates from the multiple layers of this neural network; the more layers, the deeper the learning.
3) Supervised fine-tuning:
Using the methods described above, we can obtain many layers. As for the required number of layers (or the necessary depth), there's currently no scientific method for evaluation; it requires experimentation and adjustment. Each layer will provide a different representation of the original input. Of course, we believe the more abstract the better, much like the human visual system.
At this point, the autoencoder cannot be used to classify data because it hasn't learned how to connect an input to a class. It has only learned how to reconstruct or reproduce its input. In other words, it has only learned to acquire a feature that can well represent the input, and this feature can represent the original input signal to the greatest extent. Therefore, to achieve classification, we can add a classifier (such as logistic regression, SVM, etc.) to the top encoding layer of the autoencoder, and then train it using the standard supervised training method for multi-layer neural networks (gradient descent).
In other words, at this point, we need to input the feature code of the last layer into the final classifier, and fine-tune it through supervised learning using labeled samples. This can be done in two ways: one is to adjust only the classifier (the black part):
Another approach: fine-tuning the system using labeled samples (this is ideal if there is enough data; i.e., end-to-end learning).
Once the final supervised training is complete, the system can be used for classification. The top layer of the neural network can serve as a linear classifier, and then we can add an autoencoder to create a higher-performing classifier to replace it.