Deep learning methods for complex visual big data

With the rapid development of electronic information technology and the widespread application of various cameras, global image and video data is exploding, and human society is entering the era of big data in visual information. While the massive amount of images and videos facilitates people's production and life, it also poses new challenges to intelligent vision technology.

Most current vision processing systems can effectively acquire, transmit, and store images and videos, but they lack efficient and accurate methods for analyzing, recognizing, and mining their content. First, image and video content is complex, encompassing diverse scenes and numerous object types, requiring processing methods to be robust against such a wide variety of objects. Second, under uncontrolled conditions, image and video content varies greatly due to factors such as lighting, pose, and occlusion, demanding robustness to complex changes in processing methods. Finally, image and video data is massive in volume and has high feature dimensions; some applications require real-time processing, placing high demands on computational efficiency for massive datasets. The rapid development of deep learning methods in recent years has provided an effective approach to solving these problems.

Figure 1. Characteristics, challenges, and core issues of visual big data

The Past and Present of Deep Learning Methods

Deep learning, as an extension of traditional neural networks, has made significant progress in semantic cognition problems in speech, image, and natural language processing in recent years, providing a general framework for solving the representation and understanding of visual big data. Deep learning utilizes deep neural networks with multiple hidden layers to solve artificial intelligence tasks requiring highly abstract features. Deep learning draws inspiration from the multi-layered (typically 8-10 layers) neural cell processing structure of the human brain. This multi-layered nonlinear structure endows deep neural networks with the ability to extract abstract semantic features and model complex tasks. Traditional neural networks, however, are limited by the overfitting problem, making it difficult to train multi-layered network models with strong generalization capabilities.

Deep learning discovers distributed feature representations of data by combining low-level features to form more abstract high-level representations of attribute categories. One motivation for developing deep learning is to simulate the analytical processing mechanism of the human brain to interpret data. The human cerebral cortex has a multi-layered structure, where information is processed and abstracted layer by layer. Deep architecture can be viewed as a form of "factorization," extracting reusable features that express essential characteristics from complex data. Due to their multi-layered non-linear structure, deep learning models possess powerful capabilities, making them particularly suitable for learning from large datasets (Figure 2). This is because traditional shallow models, due to their limited capabilities, often saturate when the amount of training data increases to a certain extent, failing to fully utilize the effective information contained in large-scale training data. In contrast, deep learning methods, due to their powerful capabilities, can more fully utilize large-scale data to extract effective features.

Figure 2. Performance comparison of deep learning methods and non-deep learning methods as the amount of training data increases.

Advances of Deep Learning Methods in the Field of Vision

Currently, deep learning has made groundbreaking progress in multiple application areas of artificial intelligence, such as image classification, speech recognition, and natural language understanding. Due to its superior performance, deep learning has also attracted widespread interest from industry, with internet companies such as Google, Facebook, Microsoft, and Baidu becoming important forces in deep learning technology innovation. In the speech domain, deep learning has replaced the Gaussian Mixture Model (GMM) in acoustic models with a Deep Belief Network (DBN), achieving a significant reduction in relative error rate (around 30%), and has been successfully applied to speech recognition engines from Microsoft, Google, and iFlytek. In the field of machine translation, neural language models have achieved better results than traditional methods. In 2016, the AlphaGo Go program developed by Google DeepMind, relying on the powerful capabilities of deep learning and reinforcement learning, defeated South Korea's top Go player Lee Sedol 3-1 in the human-machine Go match.

Object classification

In the field of image processing, Krizhevsky et al. achieved significantly better results than traditional methods in the large-scale image classification competition ImageNet LSVRC-2012 (containing 1,000 categories and 1.2 million images), using multi-layer convolutional neural networks. They drastically reduced the Top-5 error rate from 26% to 15%. This neural network had 7 layers, containing approximately 650,000 neurons and 60 million parameters. Currently, convolutional neural networks have become the mainstream method in this field. Building on this, researchers have proposed deeper networks such as VGGNet, GoogLeNet, and ResidualNet, further improving the performance of deep learning methods in large-scale image classification. Deep networks can also accurately detect the location of objects in images and predict the position and pose of body parts such as hands, head, and feet.

Figure 3 AlexNet network structure

Human Image Analysis

In face recognition, deep neural networks have surpassed the accuracy of human eye recognition on the LFW database, which is notoriously difficult to use in the field. Figure 4 shows the DeepID network structure that has achieved excellent performance in face recognition. This network proposes to use locally shared convolutions based on the special characteristics of face structure, thereby improving the network's ability to classify face images. This paper proposes a latent factor convolutional neural network for cross-age face recognition. This network introduces latent factor learning into deep networks, decomposing the features of fully connected layers in deep networks into two parts: identity and age (Figure 5). This provides a new approach to improve the robustness of deep networks to age changes. Experiments show that the network achieves a 99% accuracy rate on the well-known LFW database, surpassing the human eye's performance of 97% on this database. It also achieves leading recognition rates of 88.1% and 98.5% on the important cross-age databases Morph and CACD, respectively. Furthermore, a center loss function for deep networks is proposed for the first time to enhance the clustering effect in deep feature learning. Experiments show that this method can improve the performance of deep face recognition networks and achieves good results in the FGNet task of the MegaFace international test of comparing millions of faces.

Figure 4 DeepID portrait classification network structure [9]

Figure 5. Latent factor convolutional neural network for cross-age face recognition

Scene recognition

Scene recognition and understanding is a fundamental problem in computer vision. Traditional scene recognition methods largely rely on local features such as SIFT, HOG, and SURF. In recent years, convolutional neural networks have also been used for scene classification. Early methods found that fine-tuning networks trained on the large-scale object database ImageNet also performed well in scene classification. However, compared to object classification, scene categories are more abstract, and the content and layout of images within the same scene category may contain complex variations. MIT's AI Lab launched the PLACE large-scale scene database, promoting the application of deep neural networks in large-scale scene classification, allowing researchers to directly utilize scene data without relying on ImageNet to train deep models for scene classification. Many network architectures that perform well in object classification, such as AlexNet, VGGNet, GoogLeNet, and ResidualNet, have also achieved good results in scene classification. Research shows that strategies such as Dropout and multi-scale data augmentation help train deep networks and alleviate overfitting problems; methods such as Relayback propagation can improve the performance of deep networks for scene classification. Compared to traditional hand-designed features, deep neural networks learn scene features with richer expressive power and stronger semantic meaning, thus achieving better results in recognition tasks.

Figure 6 Knowledge-guided convolutional neural network

Behavior recognition

Action recognition is a crucial problem in computer vision. In recent years, researchers have gradually introduced deep neural networks into video analysis and understanding, making it a new research direction in action recognition. Karpathy et al. proposed a convolutional neural network (CNN) that uses different temporal fusion strategies to achieve action recognition in videos. However, despite pre-training with massive amounts of data (sports-1M), the accuracy of this model in action recognition still needs improvement.

Another popular approach is 3DCNN, which extends standard 2DCNN along the time axis to achieve spatiotemporal modeling of videos. However, the enormous training complexity requires massive amounts of data or decomposition of 3D convolutional kernels. Oxford University proposed a dual-stream CNN framework to avoid the training problems of 3DCNN. Through the independent design of Appearancestream and MotionStream CNNs, this framework achieves accurate action recognition on standard databases UCF101 and HMDB51. However, the input of the MotionStream CNN is superimposed optical flow, which means that this framework can only capture short-term motion information and ignores the understanding of long-term motion information in the video. To further improve the recognition accuracy of this structure, the authors proposed Trajectory-pooled Deep Descriptors (TDDs). This method provides a new mechanism for fusing deep models with traditional trajectory features. Experiments show that this deep trajectory feature TDD has stronger representational power and discriminative power than traditional hand-designed features and traditional deep models, and can significantly improve the accuracy of video classification. Deep models for video key domain mining and temporal segmentation neural networks have also been developed to improve the spatiotemporal modeling capabilities of such frameworks. Additionally, Enhanced Motion Vector Convolutional Neural Network (EMV-CNN) has been developed, which uses motion vectors to replace computationally intensive optical flow, achieving speedups of over 20 times. The success of Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) models, in various sequence modeling tasks has led to the gradual development of deep learning-based behavior recognition methods towards sequence modeling. A common training method is to use features extracted by dual-stream CNNs as input to the LSTM for training the sequence model.

Figure 7. Deep convolutional video features of trajectory sampling

In addition, deep learning has achieved better results than traditional methods in many tasks such as image restoration and super-resolution, image quality assessment, semantic segmentation and parsing, image content text generation, and medical image analysis, which has greatly promoted the development of related technologies and methods.

Development trend

While deep learning methods have made significant progress, they still face considerable challenges in many applications of computer vision, primarily in the following aspects:

First, current deep learning methods often rely on large-scale data for training. However, not all visual problems have sufficient training samples. For example, retrieval of specific people or objects, identification of rare species, and rare cases in medical images may result in very scarce training data or the cost of collecting large amounts of samples is very high. In contrast, the human visual system only needs a small number of samples to identify categories, largely because humans can reuse knowledge and experience learned in other domains. In recent years, learning with small datasets has increasingly attracted the attention of researchers. How to utilize small datasets for effective deep learning remains a challenging problem to be solved.

Secondly, deep convolutional networks employ backpropagation algorithms for parameter learning, which requires training data with explicit and abundant supervision information. However, in many practical problems, detailed and accurate image labeling is extremely time-consuming (such as pixel-level labeling in scene analysis and fine spatiotemporal labeling in videos); furthermore, much training data lacks supervision information or contains noise (such as photos collected from the internet). Therefore, understanding how to utilize weakly supervised, noisy, and unsupervised supervision information to train deep networks is of significant practical importance for utilizing large amounts of incompletely labeled data.

Finally, deep neural networks are often massive in scale and have numerous parameters. For example, even with the input image scaled down to 224×224, the AlexNet model still contains 60 million parameters. This makes deep neural networks difficult to apply in environments with limited computing and storage resources, such as mobile devices and embedded systems. It also makes it difficult to directly use high-resolution images as input to deep networks. Therefore, compressing and accelerating complex deep network models to reduce computational and storage consumption is of practical significance in addressing the resource-constrained challenges of deep learning methods.

Deep learning methods for complex visual big data

Read next

CATDOLL Dora Hybrid Silicone Head

A Brief Discussion on Artistic Design in Human-Computer Interface Design

CATDOLL Mimi Hard Silicone Head

CATDOLL 139CM Sasha Silicone Doll