Blob Analysis
In computer vision, a blob refers to a connected region in an image that shares similar colors, textures, and other features. Blob analysis analyzes connected regions of the same pixels in an image (these connected regions are called blobs). The process involves binarizing the image to segment it into foreground and background, then performing connected region detection to obtain blob blocks. Simply put, blob analysis identifies small regions with abrupt changes in grayscale within a smooth area. For example, suppose we have a newly manufactured piece of glass with a very smooth and flat surface. If this glass has no defects, we cannot detect abrupt changes in grayscale; conversely, if, due to various reasons during the glass production process, there is a small bubble, a black spot, or a crack on the glass, we can detect texture on this piece of glass. After binarization (Binary Thresholding), the colored spots in the image can be considered blobs. These parts are defects caused during the production process, and this process is blob analysis. Blob analysis tools can isolate targets from the background and calculate their number, location, shape, orientation, and size, as well as the topological structure between related blobs. Instead of analyzing individual pixels one by one, the process operates on rows of the image. Each row of the image uses run-length encoding (RLE) to represent the adjacent target range. This algorithm significantly improves processing speed compared to pixel-based algorithms.
For 2D target images and high-contrast images, blob analysis is suitable for target recognition applications such as presence/absence detection and defect detection. It is commonly used for 2D target images, high-contrast images, presence/absence detection, numerical range, and rotation invariance requirements. Clearly, blob analysis is used in many applications, such as defect detection in textiles, glass, mechanical parts, cola bottles, and pharmaceutical capsules. However, blob analysis is not suitable for the following types of images: 1. Low-contrast images; 2. Images where necessary image features cannot be described by two gray levels; 3. Template-based detection (graphic detection requirements). In general, blob analysis detects blobs in an image and is suitable for scenarios with a simple background, foreground defects that are not categorized, and low accuracy requirements.
Template matching
Template matching is a basic and fundamental pattern recognition method. It studies where a specific object's pattern is located within an image to identify that object – this is a matching problem. It's the most basic and commonly used matching method in image processing. In other words, given a small image to be matched, we search for the target within a larger image. If the target exists in the larger image and shares the same size, orientation, and image elements as the template, we can find the target and determine its coordinates by statistically calculating features such as the image's mean, gradient, distance, and variance. This means that the template we're looking for must be precisely present in the image. "Precisely present" here means that if the image or template changes—for example, through rotation, modification of a few pixels, or image flipping—matching becomes impossible; this is a drawback of this algorithm. Therefore, this matching algorithm compares the template image with the image of the small object from left to right and top to bottom in the image to be detected.
In OpenCV, the `cv2.matchTemplate(src, templ, result, match_method)` method can be called. `src` is the image to be detected, `templ` is the template library, and `match_method` is the matching method. This method offers better detection accuracy compared to blob analysis and can also distinguish different defect categories. It's essentially a search algorithm that searches and matches the image to be detected against all images in the template library using the specified matching method based on different ROIs. It requires high consistency in the shape, size, and method of defects; therefore, a well-developed template library is needed to achieve usable detection accuracy.
Deep learning method
The introduction of R-CNN in 2014 led to the mainstream adoption of CNN-based object detection algorithms. The application of deep learning improved both detection accuracy and speed. Since AlexNet significantly improved image classification accuracy by using convolutional neural networks in competitions, researchers have attempted to apply deep learning to object category detection. Convolutional neural networks can not only extract higher-level, more expressive features, but also perform feature extraction, selection, and classification within the same model. In this regard, there are two main types of algorithms: one is the two-stage object detection algorithm based on classification, the R-CNN series, which combines RPN networks; the other is a single-stage object detection algorithm that transforms object detection into a regression problem. Object detection aims to find objects of interest in images or videos, while simultaneously detecting their location and size; it is one of the core problems in the field of machine vision.
Object detection involves many uncertainties, such as the uncertain number of objects in an image, the different appearances, shapes, and poses of objects, and interference from factors like lighting and occlusion during image formation, making detection algorithms quite challenging. Since the advent of deep learning, object detection development has mainly focused on two directions: two-stage algorithms like the R-CNN series and one-stage algorithms like YOLO and SSD. The main difference is that two-stage algorithms first generate a proposal (a pre-selected bounding box that may contain the object to be detected) and then perform fine-grained object detection. One-stage algorithms, on the other hand, directly extract features from the network to predict object classification and location. The core of the region extraction algorithm in two-stage algorithms is the Convolutional Neural Network (CNN). It first uses the CNN backbone to extract features, then finds candidate regions, and finally uses a sliding window to determine the target category and location. R-CNN first extracts about 2,000 regions of interest (ROIs) using the SS algorithm, and then extracts features from these ROIs. However, it has drawbacks: weights cannot be shared between ROIs, leading to redundant calculations; intermediate data needs to be stored separately, consuming resources; and forced scaling of the input image affects detection accuracy.
SPP-NET addresses the input image size limitation issue by performing some processing between the last convolutional layer and the first fully connected layer to ensure a consistent input fully connected layer size. SPP-NET candidate regions encompass the entire image, requiring only one convolutional network pass to obtain features for the entire image and all candidate regions. FastR-CNN borrows from SPP-NET's feature pyramid and proposes ROIPooling to map candidate region feature maps of various sizes into feature vectors of a uniform scale. First, candidate regions of different sizes are divided into M×N blocks, and then max pooling is performed on each block to obtain a single value. In this way, all candidate region feature maps are unified into M×N dimensional feature vectors. However, generating candidate boxes using the SS algorithm is very time-consuming.
Faster R-CNN first extracts image features using a CNN backbone, which are shared by the RPN network and subsequent detectors. After the feature maps enter the RPN network, nine anchor boxes of different scales and shapes are pre-defined for each feature point. The intersection-over-union ratio (IoU) and offset of the anchor boxes and the ground truth bounding boxes are calculated to determine whether a target exists at that location. The predefined anchor boxes are classified as foreground or background. The RPN network is then trained using a bias loss function to perform position regression and correct the position of the ROI. Finally, the corrected ROI is fed into the subsequent network. However, during the detection process, the RPN network needs to perform a regression screening to distinguish foreground and background targets, and the subsequent detection network performs further sub-classification and position regression on the ROI output by the RPN. These two calculations result in a large number of model parameters. Mask R-CNN adds a parallel mask branch to Faster R-CNN, generating a pixel-level binary mask for each ROI. In Faster R-CNN, ROIPooling is used to generate feature maps of a uniform scale. This can cause misalignment when mapping back to the original image, making it impossible to accurately align pixels. This has a relatively small impact on object detection, but for pixel-level segmentation tasks, the error cannot be ignored. MaskR-CNN uses bilinear interpolation to address the problem of inaccurate pixel alignment. However, due to its inheritance of a two-stage algorithm, its real-time performance is still not ideal.
The first-order algorithm performs feature extraction, target classification, and location regression throughout the convolutional network, obtaining the target location and category through a single backpropagation. While its recognition accuracy is slightly lower than that of the two-stage target detection algorithm, it significantly improves speed. YOLOv1 uniformly scales the input image to 448×448×3 and then divides it into 7×7 grids. Each grid is responsible for predicting the location and confidence of two bounding boxes (b-boxes). These two b-boxes correspond to the same category; one predicts a large target, and the other predicts a small target. The b-box positions do not need to be initialized but are calculated by the YOLO model after the weights are initialized. During training, the model adjusts the predicted b-box positions as the network weights are updated. However, this algorithm performs poorly in detecting small targets, and each grid can only predict one category.
YOLOv2 divides the original image into a 13×13 grid. Through cluster analysis, it determines that each grid has 5 anchor boxes, and each anchor box predicts one class. Target location regression is performed by predicting the offset between the anchor box and the grid. SSD retains the grid division method but extracts features from different convolutional layers of the base network. As the number of convolutional layers increases, the anchor box size increases, thereby improving SSD's detection accuracy for multi-scale targets.
YOLOv3 uses clustering analysis, pre-setting three anchor boxes for each grid, utilizing only the first 52 layers of Darknet, and extensively employing residual layers. Downsampling is used to mitigate the negative impact of pooling on gradient descent. YOLOv3 extracts deep features through upsampling, making them the same dimension as the shallow features to be fused, but with a different number of channels. Feature fusion is achieved by concatenating these channels, fusing feature maps at three scales: 13×13×255, 26×26×255, and 52×52×255. The corresponding detection heads all use fully convolutional structures. YOLOv4, building upon the original YOLO object detection architecture, adopts some of the best optimization strategies in the CNN field in recent years, optimizing various aspects such as data processing, backbone network, network training, activation functions, and loss functions. To date, many high-accuracy object detection algorithms have been proposed, including recent research on transformers in the field of vision, which continues to improve the accuracy of object detection algorithms.
In summary, the choice of representation has a significant impact on the performance of machine learning algorithms, and feedforward networks trained under supervised learning can be considered a form of representation learning. Traditional algorithms such as blob analysis and template matching involve manually designing their feature representations, while neural networks automatically learn suitable feature representations for the target. Compared to manual feature design, neural networks are more efficient and faster, requiring less specialized feature design knowledge. Therefore, they can identify targets with varying shapes, sizes, and textures in different scenes, and the detection accuracy further improves as the dataset increases.
In summary, leveraging the advantages of deep learning algorithms, they have been widely applied in our company's smart logistics field. Examples include package segmentation and positioning in visual single-item separation equipment, package contour and attribute recognition in 3D vision-based unordered grasping work stacks, and package recognition guidance in 3D vision-based unpacking and palletizing work stacks. Our algorithm expert, Li Bo, stated, "The future development of AI will shine brightly on the basis of deep learning, endowing machines with the capabilities of multi-dimensional perception, autonomous learning, autonomous analysis, and precise execution."