Artificial Intelligence Unveiled: What are the Mainstream Machine Vision Technologies?

Currently, the goal of machine vision is to build a machine vision system that can handle specific tasks in a controlled environment. Because the vision environment in industry is controllable and the tasks being handled are specific, most machine vision applications are now used in industry.

Human visual perception involves capturing light sources through the cone and rod cells of the retina, which are then transmitted to the visual cortex of the brain via nerve fibers, forming the images we see. Machine vision, however, operates differently. The input to a machine vision system is an image, and the output is a perceptual description of that image. This description is closely related to the objects or scenes in the images, and it helps the machine perform specific subsequent tasks, guiding the robotic system to interact with its surroundings.

So, what are the mainstream machine vision technologies to date?

01 The Backbone – Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are currently the most widely used model architecture in computer vision. Introducing CNNs for feature extraction allows for the extraction of feature patterns between adjacent pixels while ensuring that the number of parameters remains constant regardless of image size. The image above shows a typical CNN structure, with multiple convolutional and pooling layers acting on the input image. A series of fully connected layers are typically added at the end of the network. The ReLU activation function is usually applied to the output of the convolutional or fully connected layers. Dropout is also commonly used to prevent overfitting.

Since AlexNet won the ImageNet competition in 2012, convolutional neural networks have gradually replaced traditional algorithms as the core for processing computer vision tasks.

In recent years, researchers have made significant improvements to the structure of convolutional neural networks by enhancing feature extraction capabilities, improving backpropagation gradient update effects, shortening training time, visualizing internal structures, reducing the number of network parameters, lightweighting models, and automatically designing network structures. As a result, a series of classic models such as AlexNet, ZFNet, VGG, NIN, GoogLeNet and the Inception series, ResNet, WRN and DenseNet, as well as lightweight models such as the MobileNet series, ShuffleNet series, SqueezeNet and Xception have been developed.

Convolutional network diagram

Classic model, AlexNet:

AlexNet was the first deep neural network, and its main features included:

1. Use ReLU as the activation function.

2. The use of Dropout in fully connected layers was proposed to avoid overfitting. Note: Dropout was replaced by Batch Normalization (BN) after its introduction.

3. Due to the small GPU memory, two GPUs were used, which was done by grouping them on the channels.

4. Local Response Normalization (LRN) is used because lateral inhibition exists in biological systems, meaning activated neurons suppress surrounding neurons. The goal here is to make large local response values relatively larger and suppress other convolutional kernels with relatively smaller response values. For example, if a feature has a large response value in one convolutional kernel, its response value in other adjacent convolutional kernels will be suppressed, thus reducing the correlation between convolutional kernels. Combining LRN with ReLU improves the model performance by slightly more than one percentage point.

5. Use overlapping pooling. The authors believe that using overlapping pooling improves feature richness and makes it relatively less prone to overfitting.

ResNet: A masterpiece of culmination

Generally speaking, the deeper and wider the network, the better the feature extraction ability. However, when the network reaches a certain number of layers, the accuracy decreases and the network converges more slowly.

Traditional convolutional networks have only one connection per layer during a forward pass. ResNet adds residual connections, thereby increasing the flow of information from one layer to the next. FractalNets repeatedly combine several parallel layer sequences with varying numbers of convolutional blocks, increasing nominal depth while maintaining a short forward propagation path. Similar operations include Stochastic depth and Highway Networks. These models all exhibit a common feature: shortening the path between preceding and subsequent layers, with the primary goal of increasing the flow of information between different layers.

02 Rising Stars – Transformers

The Transformer is a self-attention model architecture that has achieved great success in the field of NLP since 2017, especially in sequence-to-sequence (seq2seq) tasks such as machine translation and text generation. In 2020, Google proposed the pure transformer architecture ViT, which achieved performance comparable to CNNs on the ImageNet classification task. Subsequently, many Transformer architectures derived from ViT have also achieved success on ImageNet.

Compared to CNNs, Transformers have the advantage of less inductive reasoning and prior knowledge, thus they can be considered general computational primitives for different learning tasks, with parameter efficiency and performance gains comparable to CNNs. However, a drawback is that they are more dependent on large datasets during pre-training because Transformers do not have explicitly defined inductive priors like CNNs. Therefore, a new trend has emerged: when self-attention is combined with CNNs, they establish a strong baseline (BoTNet).

Vision Transformer (ViT) directly applies the pure Transformer architecture to a series of image patches for classification tasks, achieving excellent results. It also outperforms state-of-the-art convolutional networks on many image classification tasks, while significantly reducing the pre-training computational resources required.

DETR is the first object detection framework to successfully use the Transformer as a major building block in its pipeline. It matches the performance of previous state-of-the-art methods (highly optimized Faster R-CNN) with a simpler and more flexible pipeline.

Variant models of Transformer are currently a research hotspot, and can be mainly divided into the following types: 1) lightweight model; 2) enhanced cross-module connections; 3) adaptive computation time; 4) introducing divide-and-conquer strategy; 5) recurrent Transformers; 6) hierarchical Transformers.

03 Deceiving the Machine's Eyes – An Adversarial Example

One issue that has recently drawn attention in the research community is the sensitivity of these systems to adversarial examples. An adversarial example is a noisy image designed to trick the system into making incorrect predictions. In order to deploy these systems in the real world, they must be able to detect these examples. To this end, recent work has explored the possibility of making these systems more robust against adversarial attacks by including adversarial examples during training.

Currently, model attacks are mainly classified into two categories: attacks on the training phase and attacks on the inference phase.

Attacks during the training phase (Training in Adversarial Settings) primarily involve making subtle perturbations to the model's parameters to cause deviations between the model's performance and expectations. For example, this can be achieved by directly replacing the labels in the training data, making the data samples and labels mismatched, thus ensuring the final training result differs from the expectation. Alternatively, attackers can gain access to the training data online and manipulate malicious data to perturb the online training process, ultimately resulting in outputs that deviate from expectations.

Inference-stage attacks (Inference in Adversarial Settings) occur after a model has been trained. We can subjectively view the model as a box; if this box is transparent to us, it's considered a "white-box" model; otherwise, it's a "black-box" model. A "white-box attack" requires knowing all the model's parameters, which is impractical in real-world scenarios but possible, hence the need for this assumption. Black-box attacks are more in line with real-world scenarios: guessing the model's internal structure through input and output; introducing significant perturbations to attack the model; building shadow models for person-of-fact attacks; extracting sensitive training data; and inverse model parameters, etc.

Defense mechanisms against adversarial attacks. Resisting adversarial attacks primarily relies on introducing auxiliary blocks (AuxBlocks) with additional information to provide extra outputs as a self-integrating defense mechanism. This mechanism is particularly effective against both black-box and white-box attacks. In addition, defensive distillation can also provide some defense capability. Defensive distillation involves transferring a trained model to a simpler network structure to achieve the effect of defending against adversarial attacks.

Examples of applications of adversarial learning include: 1) autonomous driving; 2) financial fraud.

Autonomous driving is the future direction of intelligent transportation, but people are hesitant to trust this complex technology until its safety is fully verified. While many automakers and technology companies have conducted numerous experiments in this field, adversarial examples remain a significant challenge for autonomous driving. Several attack examples illustrate this: adversarial attacks render pedestrians invisible to the model in images, causing the model to "ignore" road obstacles; when specific images are generated and interfered with using AI adversarial examples, Tesla's Autopilot system outputs "incorrect" recognition results, causing the vehicle's windshield wipers to activate; placing several adversarial example stickers at specific locations on the road can cause a car in Autopilot mode to merge into the oncoming lane; controlling the vehicle's direction using a game controller within the Autopilot system; and adversarial examples make pedestrians "invisible" to machine learning models.

04 Self-learning can also lead to success – self-monitored learning

Deep learning requires clean, labeled data, which is difficult to obtain for many applications. Annotating large amounts of data requires significant human effort, which is time-consuming and expensive. Furthermore, the distribution of data in the real world is constantly changing, meaning the model must be continuously trained on ever-changing data. Self-supervised methods address some of these challenges by training the model using large amounts of raw, unlabeled data. In this case, supervision is provided by the data itself (not human annotations), and the goal is to complete an indirect task. Indirect tasks are often heuristics (e.g., rotation prediction) where both input and output come from unlabeled data. The goal of defining an indirect task is to enable the model to learn relevant features that can later be used for downstream tasks (often with some annotations available).

Self-supervised learning is a data-efficient learning paradigm. Supervised learning methods teach models to excel at specific tasks. On the other hand, self-supervised learning allows learning general representations not specifically designed for solving particular tasks, but rather encapsulates richer statistics for a variety of downstream tasks. Among all self-supervised methods, contrastive learning further improves the quality of extracted features. The data-efficient nature of self-supervised learning makes it advantageous for transfer learning applications.

The current field of self-supervised learning can be broadly divided into two branches. One is self-supervised learning for solving specific tasks, such as scene de-occlusion discussed previously, as well as self-supervised depth estimation, optical flow estimation, and image-to-image point matching. The other branch is used for representation learning. A typical example of supervised representation learning is ImageNet classification. In unsupervised representation learning, the most prevalent method is self-supervised learning.

Self-supervised learning methods rely on the spatial and semantic structure of data. For images, spatial structure learning is extremely important, hence its widespread application in computer vision. One approach involves using various techniques, including rotation, stitching, and colorization, as pre-tasks to learn representations from images. For colorization, a grayscale photo is taken as input, and a color version of the photo is generated. Another widely used method for self-supervised learning in computer vision is image patching. An example includes the paper by Doersch et al. In this work, a large dataset of unlabeled images was provided, and random pairs of image patches were extracted from it. After an initial step, a convolutional neural network predicted the position of the second image patch relative to the first. Other different methods are also used for self-supervised learning, including repairing and identifying misclassified images.

Conclusion:

Since the advent of AlexNet in 2012, the field of machine vision has seen rapid advancements. Machine vision has gradually approached and even surpassed human vision in many areas. With continued technological progress, machine vision technology will undoubtedly become even more powerful, bringing us even more surprises in fields such as security, autonomous driving, defect detection, and object recognition.

Artificial Intelligence Unveiled: What are the Mainstream Machine Vision Technologies?

Read next

CATDOLL 42CM Silicone Reborn Baby Doll

CATDOLL 102CM B04 TPE Doll with Anime Head

CATDOLL 135CM Lucy(TPE Body with Hard Silicone Head)

CATDOLL 128CM Luisa