Share this

What is machine vision? What are its applications?

2026-04-06 03:40:03 · · #1

When discussing robot vision, computer vision and machine vision inevitably come to mind, and many people confuse the three.

Computer vision is a science based on image recognition, which outputs results solely through image recognition. A representative company in this field is Google.

Machine vision is mostly used for quality inspection on production lines. It is generally based on 2D recognition and is widely used in the 3C electronics industry. Cognex is a representative company in this field.

Robot vision refers not only to using visual information as input, but also to processing this information and extracting useful information for the robot. The goal is to truly transform the robot into a "robot," rather than just a robotic arm.

(one)

Traditional robotic arms are merely automated devices that handle fixed actions through programming; they cannot process dynamic objects. Robot vision, on the other hand, requires robots to possess 3D vision, enabling them to handle three-dimensional objects in space and employing complex algorithms to capture intricate information such as position, movement, and trajectory. This necessitates the use of artificial intelligence and deep learning.

Robot vision serves cognitive robots, and the ability to continuously learn is particularly crucial. Whether it's detection or localization guidance, the more times a robot performs these tasks, the higher its accuracy will be as the data grows and changes. This is similar to the learning and growth ability of humans.

Robot vision is a research method for solving problems. After a long period of development, various methods have emerged in areas such as localization, recognition, and detection. It uses common cameras as tools and images as the processing medium to acquire environmental information.

1. Camera model

Cameras are the primary tool for robot vision and the medium through which robots communicate with their environment. The mathematical model of a camera is a pinhole model, the core of which lies in solving for similar triangles. Three points are worth noting: 1.11/f = 1/a + 1/b

The focal length equals the object distance plus the image distance. This is an imaging theorem; a sharp image can only be formed when this condition is met. 1.2X = x * f/Z

If the focal length f is continuously changed while the camera is moved to change the Z-axis, the number of pixels (X) occupied by the object x in the image can remain constant. This is the Dolly Zoom principle. If an object is behind the first object (with a larger Z-axis), this principle can be used to arbitrarily adjust the ratio of the two objects in the photograph. 1.3

A longer focal length results in a smaller field of view, allowing you to capture distant objects in sharpness. It also creates a greater depth of field in the photograph.

2. Vanishing point

The vanishing point is unique to photographs. This point does not directly exist in the photograph, nor does it exist directly in reality. Due to projective transformation, lines that were originally parallel in a photograph tend to intersect. If we seek the intersection point of parallel lines in the image, this point corresponds to a point at infinity in reality. The image coordinates of this point are [X1 X1 1]. This point is called the vanishing point. The line connecting the camera's optical center and the vanishing point points in the direction of the vanishing point in the camera coordinate system.


Furthermore, the vanishing points in all directions on the same plane will form a straight line in the image, called the horizon line. This principle can be used to measure the height of a person standing on the ground. It is worth noting that the horizon height is only the camera height when the camera is horizontal.


2.1 Pose Estimation

If we can obtain two vanishing points in an image, and the directions corresponding to these two vanishing points are perpendicular to each other (grid), then we can estimate the camera's pose relative to this image (target pose estimation). After obtaining the camera's rotation vector relative to the target, if the camera's internal parameters and the projective transformation matrix are known, the distance between the camera and the target can be calculated, and the robot's position can be estimated. H = K^-1*(H projective matrix)


2.2 Point-line duality

p1×p2 = L12 L12×L23 = p2

3. Projective transformation

Projective transformation is a transformation from one plane to another in space. Any invertible matrix H, aligned to its sub-coordinates, represents a projective transformation. In short, it can be expressed as A = HB, where AB are sub-coordinates of the form [XY 1]. A major application of projective transformation is projecting one shape into another. For example, creating billboards in photographs, during sports broadcasts, or the flag that suddenly appears when a swimmer arrives. Projective transformation is also fundamental to augmented reality technology.


The core of projective transformation lies in determining H. Common solution methods can be found in machine vision textbooks.

Suppose the four points of a planar photograph are A(0, 0, 1), B(0, 1, 1), C(1, 1, 1), and D(1, 0, 1). Clearly, these four points need to be projected onto four image regions whose pixel coordinates we already know.

Furthermore, we can calculate two interesting points based on pixel positions: V1(x1, y1, z1) and V2(x2, y2, z2), both of which are image points. Let's assume their corresponding actual coordinates are (0, 1, 0) and (1, 0, 0). Then we have three interesting actual points: (1, 0, 0), (0, 1, 0), and (0, 0, 1), which happens to be an Identity Matrix. These three actual coordinates, after a projective transformation, yield pixel coordinates. Since the pixel coordinates are known, the first column of H should correspond to beta*V2, and the second column should correspond to alpha*V1.

The third column should correspond to gama*[pixel coordinates of A]. alpha, beta, and gama are constants. [The coordinates after projective transformation should be the constant multiplied by its next coordinate].

If we can solve for alpha, beta, and gamma, then we obtain the projective transformation matrix. Clearly, substituting the pixel coordinates of point C into the equation, we have 3 equations and 4 unknowns (introducing a lambda). However, the lambda doesn't affect the result; after dividing by the past, we only need...

Treating alpha/lambda, beta/lambda, and gama/lambda as unknowns will resolve the projective matrix.

Therefore, the first column of the projective transformation matrix represents the vanishing point V1, the second column represents the vanishing point V2, and the cross product of the first and second columns represents the equation of the horizontal line (point-line duality).

(two)

Last time, we introduced some basic information about robot vision, mentioning that the core task of robot vision is estimation, and the theoretical framework is projective geometry. However, the primary condition for estimation is knowing the pixel coordinates, especially the pixel coordinates of corresponding points in multiple images. We won't go into detail about single-image processing methods, but rather discuss invariant point detection and invariant features. Because the robot is constantly moving, it may capture images of the same object from different directions. The shooting distances and angles vary. Due to the nature of projective transformations, it's impossible to guarantee that the object in two images will look identical. Therefore, we need a feature extraction method (feature point detection) that ensures the detection is rotation- and scaling-invariant. In addition, we need a feature description method that is also rotation- and scaling-invariant.

1. SIFT Feature Extraction

SIFT feature extraction can be divided into the following steps: (1) multi-scale convolution; (2) pyramid construction; (3) 3D non-maximum suppression.

Multi-scale convolution is used to construct an image from near to far. Pyramids are constructed by downsampling.



For the same pixel in images at different scales, we can track its "grayscale" changes. We find that if a point responds differently to templates of different sigma values, the scale corresponding to the maximum response (the grayscale after convolution) becomes the intrinsic scale of that point. This is somewhat similar to applying different frequencies of excitation to a mechanical structure; resonance occurs at a certain frequency, and we can record this frequency to some extent as representing the structure (the frequency of a simple pendulum is only related to ml; with f, the system can be reproduced). Therefore, as long as we find a suitable template (excitation method) and the maximum response, we can obtain the intrinsic scale of each point in the image. The same object, after being photographed at different distances, will respond uniformly under the intrinsic scale. This solves the problem of scale invariance. 3D nonmaximum suppression refers to taking only the maximum response within a 3*3*3 neighborhood of a point as the feature point. Since this point has the strongest response in the spatial neighborhood, it is also rotation-invariant. From all directions, this point has the strongest response.

2. SIFT Feature Description

Feature extraction and feature description are actually two different things. Feature extraction was covered in the previous section. If there are two images, then identical feature points will definitely be found. The role of feature description is to prepare for matching; it uses the local region information of feature points as a standard to link identical feature points in two images. A feature is essentially a high-dimensional vector. It must be scale-invariant and rotation-invariant.

The HOG feature is used here. Feature description can be divided into two steps: (1) determining the local principal direction; (2) calculating the gradient histogram. Using sigma as the feature description range is a reasonable idea because sigma describes the scale, and the feature point position + scale = the local information represented by the feature point. Based on this, the gradient directions of all pixels in its neighborhood are counted, and the direction histogram is used as the feature vector, thus completing the HOG feature construction. Importantly, before counting the directions, the image principal direction and the X-axis direction need to be aligned. The schematic diagram is as follows:


In the image, the yellow, clock-like objects represent feature points plus scale, and the pointer indicates the principal orientation (PCA) of the small image. The green elements are the bins of the histogram, used to calculate the feature vectors. Finally, by matching the feature vectors, we can obtain the corresponding point pairs of Image 1 and Image 2. The two images can then be stitched together using the homography matrix. If the calibration information is known, 3D reconstruction can be performed. (III) The previous article discussed extracting feature points from the scene and matching feature points from different angles. This time, we will first introduce a tool—fitting. Fitting is essentially an optimization problem, and the most basic optimization method is linear least squares. In other words, we need to ensure that the fitting error is minimized.

1. Least squares fitting

The basic least squares fitting method solves the problem of fitting a point to a model. Taking the fitting of a point to a line as an example, according to the modeling of the fitting error, this problem can be divided into two categories.





The first type of problem uses the error of the dependent variable as the optimization objective. These problems often involve an independent variable-dependent variable model, where the units of x and y are different. The second type uses distance as the optimization objective. In these problems, the units of x and y are often the same, and the straight line does not represent a trend but rather a geometric model. Because the optimization objectives are different, the modeling methods and solutions are also different, but the solution approach is the same: both involve finding the magnitude of a vector. Since a vector is the result of matrix operations, the problem ultimately becomes a singular value decomposition problem.

2. RASAC Fitting

The Random Sample Consensus (RANSAC) algorithm was originally a classic data processing algorithm used to extract specific components from a sample under conditions of high noise. The image below illustrates the effect of the RanSAC algorithm. Some points in the image clearly satisfy a certain straight line, while another cluster of points represents pure noise. The goal is to find the equation of a straight line under conditions of high noise, where the amount of noise data is three times that of the straight line.


The least squares method cannot achieve this effect; the line would be slightly higher than the line in the graph. The principle of the random sampling consensus algorithm is explained clearly on Wikipedia, even providing pseudocode, MATLAB, and C code. I'd like to explain this algorithm in a less formal or less academic way. Essentially, this algorithm selects the most desirable data from a set of data. "Desired" naturally has a standard (the form of the target: satisfying the equation of a straight line? satisfying the equation of a circle? and the tolerable error e). Determining a straight line in a plane requires 2 points, while determining a circle requires 3 points. The random sampling algorithm is actually quite similar to a young girl choosing a boyfriend.

Pick a random guy from the crowd, assess his qualities, and then start a relationship with him. (In a plane, randomly select two points, fit a straight line, and calculate how many points within the tolerance error e satisfy this straight line.)

The next day, find another guy, see what his conditions are like, compare him to your boyfriend, and if he's better, switch to the new one (randomly select two points again, fit a straight line, see if this line can tolerate more points, and if so, record this line as the result).

On the third day, repeat the behavior from the second day (iterative loop).

Finally, I reached a certain age and married my current boyfriend. (Iteration complete, recording current result)

Obviously, if a girl uses the above method to find a boyfriend, she will eventually marry a good one (we will get the desired segmentation result). As long as the model intuitively exists, the algorithm will always have a chance to find it. Its advantage is that the noise can be arbitrarily distributed, and the noise can be much larger than the model information. This algorithm has two disadvantages: first, a suitable tolerance error e must be specified; second, the number of iterations must be specified as a convergence condition. Considering these characteristics, this algorithm is very suitable for detecting objects with special shapes from cluttered point clouds.

3. Nonlinear fitting

Linear least squares has a good explanation. However, life is often challenging; problems that can be transformed into the standard matrix form described above are relatively few. Most of the time, we're not dealing with min(||Ax - b||), but rather min(||f(x) - b||)!!!


In 3D reconstruction, if we have more than two viewpoints, the three lines are likely to not intersect at a single point. This is because the rotation matrix we choose has precision issues, and pose estimation also has errors. Singular value decomposition (SVD) can be used to find the point with the minimum distance between the three lines. Another suitable estimation method minimizes the reprojection error of this point on the three cameras. Simultaneously, R, T, and P(X, Y, Z) are estimated, ultimately ensuring the minimum reprojection error—the state of the art. Let's return to the original question: how to solve the nonlinear least squares method?


From the linear least squares method, we can obtain the matrix expression for the nonlinear least squares method. If we want to find its local minimum, the derivative with respect to x should be 0.


However, this is not easy to solve, so we consider using an iterative gradient descent approach. Here, we are using a simple gradient.


There's a particularly confusing part here: it assumes that deta_X is very small, hence the above form, to ensure that f(x + deta_X) < f(x). Iterating x iteratively guarantees that each iteration moves in the direction that f(x) decreases. In reality, this solution should be given by the Hessian matrix.


Take beacon localization as an example. Logically, drawing circles centered on two beacons should provide two analytical solutions for the location. However, if there are many beacons, the circles will cover a single region... This is a classic problem in SLAM, and a later blog post will specifically discuss bundle adjourns.


(iv) Polar geometry is the most important concept in binocular vision, a branch of robot vision. Unlike structured light vision, binocular vision is an "active measurement" method.

1. Prerequisites for the study of polar geometry

Polar geometry studies two images with overlapping regions. The goal is to extract the relationship between the camera's poses. Once the relationship between the two poses is obtained, we can perform 3D reconstruction of the scene points.

The physical quantities defined by polar geometry include four: 1. poles; 2. polar lines; 3. fundamental matrices; 4. eigenmatrices; defined as shown in the left figure. The physical quantities studied by polar geometry include four: C1 coordinates, C2 coordinates, R, and T, defined as shown in the right figure.



The essence of a pole is the point on which the optical center of another camera is mapped onto this image. The essence of an epipolar line is the line on which the light rays from another camera are mapped onto this image. (Both poles and epipolar lines are on the image.)

1.1 Eigenmatrix

The eigenvalue matrix carries the relative position information of the cameras. Its derivation is as follows: In the coordinate system of camera 2, the scene point coordinates are: X2 = RX1 + t; the optical center coordinates of camera 1 are: t; the epipolar mapping in space is: X2 - t = RX1. At this point, the three vectors lie on the same plane, so we have: X2 T tx RX1 = 0, where tx represents the cross product matrix of t. R is called the eigenvalue matrix E. Once both images are captured, R and T are determined. Any pair of corresponding points in space must satisfy the eigenvalue matrix!


1.2 Basic Matrices

If a point in space satisfies the E matrix, then the zoomed coordinates of that point must still satisfy the E matrix. The zoom of the coordinates is clearly related to the camera's internal matrix. In the camera coordinate system: x1 = KX1; x2 = KX2, where x1 and x2 are homogeneous pixel coordinates. Then, X1 = K⁻¹x1; X2 = K⁻¹x2. Substituting into the eigenvalue matrix, we get: x2 T K⁻¹Ttx RK⁻¹ x1 = 0 =======> K⁻¹Ttx = 0 =========> x2 TF x1 = 0. F = K⁻¹Ttx is called the fundamental matrix. The fundamental matrix accepts homogeneous pixel coordinates. The rank of the fundamental matrix is ​​2 because it has a zero space. Simultaneously, its degrees of freedom are 8 because it accepts homogeneous coordinates. Each set of image points can provide one equation, so the F matrix can be linearly solved from 8 sets of points. Of course, the solution is to reduce it to Ax = 0, and then use singular value decomposition to obtain the last column of v. Then, the minimum singular value is removed by second singular value decomposition and regularization.

1.3. Poles and Polar Lines

From the fundamental matrix, we know that x2 TF x1 = 0. Clearly, a familiar line exists here. By the duality of points and lines, we know that x2 lies on the line F x1. This line is the equation of the epipolar line in image 2. x1 lies on the line x2 TF. This line is the equation of the epipolar line in image 1. A pole is the intersection of multiple epipolar lines (at least two).

2. Recover R and T from the eigenma.

E = tx R = [ tx r1 tx r2 tx r3 ]

The rank of E is 2 because it has a zero space. Also, since r1 r2 r3 are orthogonal, their cross product must also be orthogonal. Therefore, we can assume that their cross product still satisfies certain properties of rotation matrices. For example, each column has the same magnitude. From t<sub>T</sub>E = 0, we know that after the singular value decomposition of E, t is the u(:, end) corresponding to the smallest singular value. As follows:




Here we assume R = UYVT. Since U, V, and R are of the same family, there must be a matrix Y that makes the above equation hold. V are mutually perpendicular, R acts as a rotation, and U are necessarily mutually perpendicular. Therefore, R must have a solution. Let's assume an intermediate variable Y. And it's easy to solve:


In summary, there are four possible solutions, corresponding to the following four cases, of which only the first one is possible. Therefore, if det(R) = 1, then the guess for z is correct. If det(R) = -1, then the solution is: t = -t; R = -R.


3. Recovering three-dimensional coordinates from spatial relationships

Given the calibration information and the positional relationship between the two cameras, the projection matrices P of the two cameras are known. For a point X1 in space, the following relationship holds: x1 = P*X1 [x1]x P X1 = 0; Obviously, we now have the magical form Ax = 0. Singular value decomposition solves this.

4. Find the F matrix using RANSAC.

With eight corresponding points, we can obtain the F matrix. Adding K, we can then perform 3D reconstruction of the two images. However, automatically obtaining these eight corresponding points is challenging. The SIFT algorithm offers a possibility for automatic matching; however, the matching results still contain many mismatches. This section aims to use RANSAC as the algorithmic basis and the fundamental matrix as the method to judge the matching results. First, due to factors such as detection errors, pixels cannot perfectly satisfy the fundamental equation. Therefore, there will be a certain distance between the point and the epipolar line. We use the perpendicular distance to model this, with the following expression:

F1 represents the first column of F. A point is considered to satisfy the F equation as long as the error is less than a threshold. The algorithm flow is as follows: 1. Randomly select 8 points; 2. Estimate F; 3. Calculate e for all points and find #inlier; 4. Return to steps 1, 2, and 3. If #inlier increases, update F_candidate; 5. Iterate many times until the algorithm ends, and F_candidate is the estimated value of F. The RANSAC algorithm once again demonstrates its excellent noise control capabilities.

(V) As mentioned earlier, the core of robot vision is estimation. Feature extraction and registration are preparations for estimation. Once registration is complete, we can estimate the robot's position and pose from the image. With position and pose, we can stitch together the 3D reconstructed data. The problem of estimating robot pose from visual information can be divided into three main categories: 1. Scene points are on the same plane. 2. Scene points are in 3D space. 3. Registration of two point clouds. All these problems share a major prerequisite: knowing the camera's internal matrix K.

1. Pose estimation from homography matrix

Homography matrix originally refers to the mapping relationship from R2 to R2.

However, in estimation problems, if we can obtain this mapping relationship, we can recover the transformation matrix from the world coordinate system x_w to the camera coordinate system x_c. This transformation matrix expresses the pose of the camera relative to x_w. H = s*K*[r1 r2 t] —— Assuming the z-coordinate on the plane is 0 s*[r1 r2 t] = k-1*H —— Using the homography matrix to obtain the rotation and translation vectors r3 = r1×r2 —— Recovering r3 s is not important; we only need to normalize k-1*h1 to obtain it. Therefore, the most important thing is how to obtain the homography between the two scenes. I mentioned earlier that homography is obtained from the vanishing point, but if it's not a rectangle-to-quadrilateral mapping, we don't have a vanishing point to find. Here, I will introduce an incredibly elegant method based on matrix transformation and singular value decomposition. JB Shi truly deserves his reputation as a master. He explained this problem so simply in just a few sentences.


Since the H matrix has 8 degrees of freedom, each homography point can provide two equations, so 4 homography points can uniquely determine the homography matrix H. Ax = 0, which we already learned in the fitting chapter. x is a column of the V matrix for the minimum singular value pair. This is the first appearance of singular value decomposition. At this point, we have recovered the H matrix. Following the normal approach, we can solve [r1 r2 t]. However, our H matrix is ​​optimized using singular value decomposition, and the inverse solution r1 r2 may not necessarily satisfy the orthogonality condition or the equal length condition. Therefore, we need to fit the RT matrix again. The fitting objective this time is to min(ROS3 - R'), where R' = [k-1H(:,1:2) x ]. The method is still singular value decomposition, R = UV'. This is the second appearance of singular value decomposition.

2. Pose estimation using projective transformation

Pose estimation using homography matrices presupposes that all points lie on a single plane. However, pose estimation using projective transformations abandons this premise; therefore, the previous section is a special case of this section. This problem is formally known as the PnP problem: perspective-n-point.

Following the same line of thought, we can still write it in the following form:


The projective matrix here has 12 unknowns: 9 from the rotation matrix and 3 from the translation vector. Each point can provide two equations. Therefore, with only 6 scene points, we can obtain the value of the P matrix using singular value decomposition. Similarly, after obtaining the P matrix, we calculate T = k-1*P, and finally correct T using singular value decomposition. However, conventionally, this problem only has 6 degrees of freedom (3 translations, 3 rotations). Using 6 points is actually a dirty method.

3. Pose estimation using two point cloud images.

This situation is likely quite common with the currently popular RGBD cameras. Given 3D images of the same object from different angles, how do we determine the transformation relationship between the two poses? An analytical solution to this problem requires a one-to-one correspondence between the points. If the points cannot be matched, then it becomes a problem requiring the ICP algorithm.

This problem is formally known as the Procrustes Problem, originating from Greek mythology. A more apt analogy is the "shoe-putting problem." It involves rotating and translating a foot to fit it into a shoe. Mathematically, it can be described as follows: by choosing appropriate values ​​for R and T, the difference between A and B can be minimized.


T is actually quite easy to guess; if two point clusters can coincide, then their centroids must also coincide. Therefore, T represents the vector between the centroids of the two point clusters. This problem can then be transformed as follows:



Matrix analysis shows that the 2-norm of a vector has the following transformations:


Matrix analysis shows that the last two terms are actually equal (due to the cycle invariance of the trace and the transpose invariance). Therefore, the optimization objective can be transformed into:


The trace is a quantity related to singular values ​​(the trace remains unchanged under similarity transformations).



Clearly, if the trace of Z is as large as possible, then there is only one possibility: Z is the identity matrix, and the trace of the identity matrix is ​​the largest among the rotation matrices. Therefore, the analytical solution to R is as follows:


Thus, we have obtained the analytical solution for 3D pose estimation! (VI) The last topic is Bundle Adjustment, the most advanced method in robot vision.

1. Camera pose estimation based on nonlinear optimization

In a previous fitting paper, we completed the work on the nonlinear least squares fitting problem. Bundle Adjustment, also known as bundle adjustment, uses nonlinear least squares methods to determine camera pose and 3D point coordinates. It aims to reconstruct surrounding objects with high accuracy, given only the camera's internal matrices. The optimization objective of Bundle Adjustment remains minimizing the repetitive projection error.


Similar to solving trigonometry using nonlinear mean squares, all parameters in the bundle adjustment, RCX, are variables. With N images, there are N poses and X points, resulting in a very large Jacobian matrix. Essentially, we need to use the Jacobian matrix for gradient descent search. See the "Fitting" section previously for details.

2. Jacobian matrix

The rows of the Jacobian matrix represent information, and the columns represent constraints. Each row represents the error of a point at that pose, and each column represents the partial derivative of f with respect to the x-component.




q, x, and c are all variables, where q is the rotation quaternion, x is the 3D point coordinate, and c is the coordinate of the camera's optical center in the world coordinate system. J can be divided into three parts: the first four columns represent the derivative with respect to the rotation, the middle three columns represent the derivative with respect to c, and the last three columns represent the derivative with respect to x. The derivative with respect to the rotation can be further decomposed into the derivative of the rotation matrix x with respect to the quaternion q. Once we obtain the expression for J, we can use the Newton-Gaussian iteration to optimize x. The mathematical expression after differentiation is as follows:





If there are two cameras, the total Jacobian matrix is ​​as follows:


By simultaneously iterating over all q CX values, we can eventually obtain the world point coordinates and camera pose simultaneously – SLAM!

Read next

CATDOLL Kara Hard Silicone Head

The head made from hard silicone does not have a usable oral cavity. You can choose the skin tone, eye color, and wig, ...

Articles 2026-02-22