This article summarizes the relevant technologies for visual surface defect detection projects from the perspective of algorithm engineers.
0. Introduction
I've been working on projects in this area for a while now. As an algorithm engineer, I've led several projects, both large and small, some successfully delivered, others ultimately failing. Looking back on the entire project process, despite the ups and downs, it has been very rewarding. I'll consider this my year-end summary for 2023.
This article not only covers technology, but also includes some content related to project management and requirements communication, which can be considered as some of my own insights.
Surface defect detection is often referred to by colleagues as "quality inspection," but I always feel that's too broad a scope. Although the PowerPoint presentations make it seem like we can do everything, in reality, we can't handle some of the more demanding requirements.
This article will not go into too much technical detail; it will mainly focus on methodology.
I. Task Definition
Surface defect detection involves carefully inspecting and evaluating a product's surface to identify and recognize any surface defects that do not meet quality standards or design requirements. The aim of this task is to ensure that the product's appearance and functionality meet predetermined requirements, thereby improving overall product quality and customer satisfaction.
Surface defect detection is widely used in various industries to ensure product quality meets standards and enhances product competitiveness. This includes...
manufacturing :
Automotive manufacturing: Inspecting surface defects in car bodies, parts, etc.
Aviation industry: Inspecting surface defects in aircraft fuselages, parts, etc.
Machining: Inspecting surface defects of machine tool beds, guide rails, and other components.
Electronics industry: Inspecting surface defects in products such as circuit boards and mobile phone casings.
Household appliances: Detecting surface defects in plastic casings, metal parts, etc.
Materials processing industry :
Steel industry: Detecting defects such as cracks and folds on the surface of steel.
Non-ferrous metals industry: Detecting surface defects in metal materials such as aluminum and copper.
Ceramics and Glass Industry: Inspecting surface defects in ceramic and glass products.
Textile and apparel industry :
Inspect textiles for surface defects such as damage, stains, and uneven coloring.
Inspect the surface defects of clothing accessories such as buttons and zippers.
Food and Packaging Industry :
Inspect the printing quality and surface defects such as missing prints on food packaging bags and containers.
Inspect the packaging materials for surface defects such as sealing and transparency.
Construction industry :
Detects defects such as cracks and honeycomb on the surface of concrete structures.
Inspect building materials such as brick, stone, and wood for surface defects.
Energy sector :
Inspecting surface defects in wind turbine blades, solar panels, etc.
Inspect the surface defects of oil and gas pipelines, such as the anti-corrosion coating and welds.
Surface defect detection in these industries is not only related to product quality, but can also affect product safety, durability, and market competitiveness. Therefore, rigorous surface defect detection can ensure the safety and reliability of products during design, production, and use.
I had a conversation with an industry leader about the AOI industry, and my personal feeling is that AOI focuses on the inspection of electronic components:
AOI (Automated Optical Inspection) is a technology that uses optical principles and automation to inspect printed circuit boards (PCBs) or other electronic components for surface defects, dimensional measurements, solder joint checks, and more. An AOI system typically includes components such as a light source, optical lenses, image acquisition equipment, and data analysis software.
II. Needs Communication
As the task definition indicates, these are customized projects, primarily B2B. Some rules are industry-standard, while others are defined by the factory itself.
Initial communication focused primarily on defect types and their importance. Further discussions on quality standards, testing standards, and feasibility assessments are needed later.
After identifying the types of defects, it is also necessary to assess the requirements for the algorithm, including real-time performance, accuracy, and detection range. Furthermore, acceptance criteria must be determined during the project initiation phase.
Defect sample collection is also a crucial task. If certain types of defects required by clients are difficult to collect, it's necessary to assess whether some can be artificially created, whether samples can be generated using algorithms, or whether the specific collection time meets project requirements. For example, when real-time requirements are high, the vendor may need to request the client to purchase hardware with better computing power, or a larger number of computing cards. Accuracy is generally included in acceptance criteria, and the definitions of these indicators must be clearly defined.
The detection scope refers to the types of defects to be detected, which ones are allowed to be missed, and which ones require at least one detection. The most common and also the most troublesome is the detection of "foreign objects." If the client defines it as an open set, it needs to be carefully considered because common supervised object detection deep learning methods cannot yet achieve this. If the client defines it as a closed set, then it is necessary to count the number of collectable samples for each type of "foreign object" defect and re-evaluate it according to the new defect category.
It is important to note that even the best detection algorithms for this type of project are limited to detection and do not have the function of repairing defects. This needs to be agreed upon during the initial requirements communication.
In addition, issuing performance evaluations at meetings is commonplace, and keeping copies of emails serves as a way to communicate with both the client and vendor leaders.
III. Imaging Scheme
The selection of an imaging scheme depends on a variety of factors, including the material and surface characteristics of the object being inspected, the required detection accuracy, detection speed, and cost. Here are some common imaging schemes:
Optical microscopy imaging :
Suitable for high-resolution imaging of small-sized defects, such as some circuit board inspections.
Optical inspection and imaging systems (such as CCD/CMOS cameras) :
It can be used with different types of optical lenses and light sources, and is suitable for the detection of various surface defects, such as steel surface defects and fabric surface defects.
Infrared imaging :
It is sensitive to changes in thermal properties and is suitable for detecting thermal defects in certain materials.
Ultraviolet imaging :
Some surface defects are more noticeable under ultraviolet light, making it suitable for surface defect detection of specific materials.
Laser scanning imaging :
By scanning the surface of an object point by point with a laser scanner and then collecting data with sensors, it is suitable for high-precision detection of large objects.
Ultrasound imaging :
Defects on the interior and surface of materials can be detected by the propagation characteristics of ultrasound in materials.
From an algorithm engineer's perspective, we often focus on the overall system's clock speed and imaging time (the time it takes for the industrial control computer to obtain a single complete image), as well as whether the final image is single-channel or multi-channel, and whether it is 2D or 3D detection.
Two points need to be noted regarding imaging:
A verification process is essential to ensure that every type of defect can be captured and easily distinguished, unaffected by normal areas. Subsequent imaging modifications are very costly, as any change can have far-reaching consequences.
The imaging scheme must ensure portability , that is, the imaging effect under the experimental conditions built in the verification phase can still guarantee a very close imaging effect when it is officially put into use.
IV. Defect Collection and Alignment
Collecting defective images is a physically demanding task, mainly involving two methods:
Manual collection : This method relies on the client's workers to collect samples, and then either the contractor or the client sends personnel to manually take photos of the samples.
Semi-manual collection combined with automated acquisition : This collection method is common in the field of defect detection for steel plates and textile fabrics. A key characteristic of this type of task is that each image captured by the camera is spatially aligned , meaning that the semantic meaning of each location within the image remains unchanged. This feature ensures that semi-supervised or unsupervised anomaly detection methods can be used to extract anomalous samples from a large number of collected normal samples. Subsequent collection of samples of fixed categories, such as scratches and cracks, requires manually training a CNN classification model to perform fine-grained classification of the defect samples.
The defect alignment process already has an outline during the requirements communication phase. The defect alignment process focuses more on alignment details. At this point, it is necessary to determine the labeling of defects: whether to label them with boxes or pixels, and whether the boxes are circumscribed rectangles or rotated rectangles. This requires comprehensive consideration of customer needs, the accuracy of defect descriptions, and the difficulty of algorithm implementation.
V. Overall Reasoning Service Framework
Surface defect detection algorithms are often implemented using target detection algorithms, which are characterized by large input images and high real-time requirements.
My personal summary of the framework for surface defect detection algorithms is as follows:
Large image preprocessing : Preprocessing of large images includes removing non-detected regions and classifying specific anomalies. These can generally be handled by manually writing some features (such as removing black borders), but sometimes it is necessary to train some models to handle them (such as classifying specific defects in large images). It is also necessary to extract some detection regions (ROI, Region of Interest).
Defect detection on large images : Some defects are easier to locate and have more obvious features on large images, so they can be detected on large images. It's important to note that defect detection on large images consumes more overall time. Therefore, if subsequent steps require detection on smaller images, from the perspective of overall system latency, it's best to postpone large-image detection to smaller images whenever possible.
Small image cropping based on ROI region : If there is an ROI region, crop according to the coordinates of the ROI region. There are two hyperparameters here, namely the size of the small image crop and the stride. These can be determined based on the detection accuracy requirements, system latency requirements, and the input image size of the small image target detector.
Small image preprocessing : Small image preprocessing includes classifying and judging anomalies in small images. In a normal detection process, normal samples account for the majority. A classifier with a relatively low time consumption can block the subsequent detection of targets in small images that are normal in large images.
Object detection on small images : Object detection on small images mainly focuses on identifying small objects, which cannot be retrieved on large images. They are generally between 7x7 pixels and 30x30 pixels. Detecting small objects is a challenge. However, in the industry, the fastest way to improve accuracy and ensure inference speed is to add data and stack up cards. As for model selection, it is natural to choose one with relatively high accuracy and other suitable aspects within the inference latency range.
Detection result merging : The final output needs to merge the detection results of the small image and the large image, including restoring the detection coordinates of the small image to the position of the large image, and merging the target boxes of the detection results (due to overlap when slicing the small image).
VI. Model Training and Optimization Iteration
The training and tuning of a model mainly includes the following processes:
Data preprocessing : This is the first step in model training, including data collection, data cleaning, handling missing values, data standardization or normalization, feature selection, and feature engineering. The goal is to transform the raw data into a format suitable for model training.
Model selection : Use a small model for small data and a large model for large data. Use a small model for simple tasks and consider a large model or a combination of small models for complex tasks. For example, some small detection models are sufficient for ROI region detection, and a small-scale model is enough to meet the requirements for detecting small targets. There is no need to use transformers or large models all the time, even though they are very popular.
Training the model : Train the model using the labeled dataset. A few parameter adjustments are usually sufficient. I personally set the learning rate to 0.1 of its original value, then load the pre-trained parameters, keeping the rest consistent. Of course, I have also made some innovations when I have more time, which will be discussed in the following chapters.
Model evaluation : Evaluate the model's performance using a validation set or test set. Common evaluation metrics include accuracy, recall, F1 score, and mAP.
Model optimization : Adjusting the data based on the model evaluation results. Note that this refers to the data itself; academically validated papers generally don't drastically modify the model unless the task is specific. Low metrics usually indicate a data problem, which can be resolved by having the data annotation team adjust it.
During model training and the process, I usually write some small adaptive tools based on some frameworks, such as
Obtain the detection metrics for each defect category.
Visualized dataset annotation
Visual model prediction
Model predictions are converted into pre-labeled data, and pseudo-labels are applied to some of the data.
Obtain the badbase and original annotations to facilitate annotation personnel in making corrections.
Optimal classification threshold search
The commonly used ones should be as shown above. I personally prefer to make tools more reusable, so for some simple needs, I will still write scripts and embed them into the model detection and training framework to make them into tools.
VII. Model Deployment
Model deployment is the process of transforming a trained model into a practical, usable service. Model deployment includes the following main tasks:
Environment configuration : Prepare a suitable environment for model deployment, including hardware resources (such as CPU/GPU, memory, storage, etc.), operating system, dependency libraries and frameworks, etc. Generally, these are provided by the vendor, and the algorithm engineer can package all services into Docker.
Model conversion : Converting a trained model into a deployment-friendly format. This may include converting the model to a specific format (most commonly ONNX) or optimizing the model's code to improve inference speed.
Service building : Integrating the model into a server or application so that it can be invoked remotely. This typically involves writing API interface code and creating the corresponding service architecture, such as microservices, RESTful APIs, etc.
Performance optimization : Ensure the model runs efficiently after deployment. This may include compressing or quantizing the model, or optimizing the service architecture to reduce latency and increase throughput.
Testing : Conduct comprehensive testing before and after deployment to ensure the model's functionality and performance meet expectations. This typically requires having a testing colleague conduct the tests and provide a test report.
Monitoring : Deploy a monitoring system to track model performance, including metrics such as accuracy, response time, and resource consumption. Load testing of the deployed system's accuracy and resource consumption is essential, as it relates to service availability.
Logging and Error Handling : Configure a logging system to track and analyze problems that arise in the model. Simultaneously, implement an error handling mechanism to provide appropriate feedback or solutions when anomalies occur.
Documentation and Training : Provide necessary documentation and training to those who use the model to ensure they understand how to use the model and services correctly.
Security and compliance : Ensure that the deployment of the model complies with regulatory requirements for data security and privacy protection, including data encryption, model encryption, user authentication, and access control.
The optimization of the service deployment architecture is reflected in the following principles:
Make full use of the GPU/NPU hardware resources of the accelerator card . For example, use hardware decoding for input videos or images as much as possible. The output after hardware decoding is obviously placed on the video memory, so that subsequent preprocessing does not need to go through the CPU. The preprocessing of input data should also be done on the accelerator card as much as possible.
Model parallelism strategies include, for example, splitting the graph into smaller segments and then performing inference on different computing cards to fully utilize multi-card inference resources. This involves load balancing techniques, such as how to reasonably distribute computational workloads across different computing devices. My personal blog contains an article titled "Thoughts on the Allocation of Computational Resources in Single-Machine Multi-GPU AI Model Inference Scenarios," which you are welcome to read.
Service Parallelism Strategy : Run multiple inference services using nginx proxy and provide a single external interface to improve service robustness and concurrency.
8. Some tricks to save reasoning delay
Image input size
The image input does not necessarily have to be square . It can be compressed proportionally. For example, if the original image has an aspect ratio of 4:1, then when training a classification or detection model, the aspect ratio can be maintained at 4:1. This can meet the performance requirements and greatly reduce inference time.
Input channel size
Since grayscale images have only one channel, there's no need to duplicate the input channel to meet model requirements . We can modify the model to have only one input channel without affecting pre-training. Please refer to my other blog post, "The Convolutional Kernel Compression Trick for Grayscale Image Classification Using ImageNet Pre-training," for more details.
Model quantization is currently a mature 8-bit quantization scheme for detection and classification models. If optimization still cannot meet the system latency requirements, model quantization can be considered.
IX. Subsequent Operation and Maintenance
Surface defect detection is an important quality control task in the manufacturing industry, and it is necessary to maintain the software service during the service period.
Subsequent maintenance work mainly includes the following aspects:
Data management and analysis : Collecting and storing test data, and conducting regular analysis to optimize testing processes and improve efficiency. This may involve using specialized data analysis software to statistically analyze the test results.
Delivery personnel training : Provide training to delivery personnel, including model upgrade strategies, bad case data collection, and model service deployment.
Troubleshooting and Feedback : If any issues are found during the testing process, prompt action is required to identify the cause and resolve the problem. Simultaneously, the issue and its solution should be documented to prevent similar problems from recurring in the future.