In the previous article, we introduced the field of machine vision and discussed a highly effective algorithm—the pixel-based intelligent classification decision tree—which has been widely used in medical image processing and Kinect. In this article, we will look at the recently popular deep neural networks (deep learning) and their successful applications in machine vision, and then examine the future development of machine vision and machine learning.
Deep Neural Networks
In recent years, the quality and quantity of training datasets used for machine vision research have improved significantly. These improvements largely rely on the development of crowdfunding, which has increased the number of labeled image samples to millions. A good dataset—ImageNet—contains millions of labeled images across tens of thousands of categories.
After several years of slow development in the ImageNet dataset community, Krizhevsky et al. ignited the field in 2012. They demonstrated that general-purpose GPU computing combined with minor changes to the algorithm could train convolutional neural networks with more layers than ever before. Their precise testing on 1000 categories of ImageNet was a landmark achievement. This garnered significant media attention and even led to numerous acquisitions of startups. Subsequently, deep learning became a hot topic in machine vision, with many recent papers expanding research methods in object localization, face recognition, and human pose estimation.
Future Outlook
There's no doubt that deep convolutional neural networks are powerful; however, can they completely solve machine vision problems? We can be certain that deep learning will continue to be popular in the coming years and will drive the development of related technologies, but we believe there's still a long way to go. While we can only speculate on what will change in the future, some trends are already evident.
Representation Methods: Currently, these neural networks can only recognize relatively simple image content and cannot yet gain a deeper understanding of the relationships between various objects in an image or the role that a particular individual plays in our lives (for example, we cannot simply assume that people in an image have wet hair just because their hair is shiny and they are all holding hair dryers). New datasets, such as Microsoft's "CoCo," can further improve this situation by providing more detailed labels for individual objects in "atypical" images, such as images containing multiple objects that are not in the most prominent positions.
Efficiency: While deep neural networks can achieve relatively fast execution speeds in image processing through parallel processing, they don't suffer from the same problem we discussed in our previous article: each test case requires traversing every node of the neural network before outputting. Furthermore, even with the fastest GPU clusters for acceleration, training a neural network can take days or weeks, limiting the speed of our experiments.
Structure learning: Currently, deep convolutional neural networks (CNNs) are a well-designed and stable structure that has been studied for many years. If we want to change it, we can only modify the size and number of layers (i.e., the depth of the neural network), which significantly impacts the overall prediction accuracy. Currently, besides simply modifying the parameters of the neural network to optimize it, we hope to learn more flexible network structures directly from the data.
Recently, we've begun addressing the problems mentioned above, especially the latter two. We are particularly pleased with our recent work on decision jungle algorithms: collections of directed acyclic graphs (DAGs). You can think of a decision jungle as a decision tree, except that each child node in a decision jungle can have multiple parent nodes. Compared to decision trees, we've shown that this reduces memory consumption by an order of magnitude while also improving the algorithm's generalization ability. Although DAGs and neural networks are very similar, they do have two significant differences: first, the structure of a DAG can be trained simultaneously with the model's parameters; second, DAGs retain the efficient computational performance of decision trees: each test sample selects only one path from the DAG instead of traversing all nodes as in a neural network. We are actively investigating whether combining decision jungles with other forms of deep learning can produce even more efficient deep neural networks.
If you are interested in trying to solve your problem using decision forests, you can explore this further using the Gemini model in Azure ML.
In conclusion, the bright future of machine vision is largely due to the development of machine learning. The recent rapid advancements in machine vision are already quite remarkable, but we believe its future remains an exciting open book.