Background Model Construction Method Based on Image Patch
2026-04-06 03:14:30··#1
Abstract: This paper proposes a background model construction method based on image block segmentation, aiming to reduce the computational redundancy caused by pixel-based background models and improve the system's running speed. The paper reviews the main background extraction methods, presents image block segmentation methods and several commonly used image block features, and utilizes these features to construct an adaptive Gaussian mixture model. An experimental comparison of this method with the traditional pixel-based background model is conducted using a set of videos; the results show that this method significantly improves the system's running efficiency while maintaining the same target detection rate. Keywords: Video surveillance; Background model; Moving target; Gaussian distribution 1 Introduction Video surveillance has been widely applied in many aspects of production practice and social life. In many situations, video surveillance is irreplaceable by humans. The main task of intelligent video surveillance is to automatically and in real-time analyze video image sequences of the monitored scene to discover moving targets in the scene, identify and track targets, and provide timely information feedback and alarms for abnormal phenomena. Detecting dynamic targets from a video sequence is the primary and fundamental task of video surveillance. Currently, many tracking systems rely on background extraction techniques for moving target detection. This involves comparing the current input image frame with a reference background model and determining whether a pixel is a target or background pixel based on the deviation of its pixel value from the background model. Then, pixels considered targets are further processed to identify the target, determine its location, and ultimately achieve tracking. Background extraction techniques are widely used in tracking systems such as video surveillance. Many methods for constructing background models have been proposed. A simple background model can be an image without moving objects, while a complex background model is a continuously updated statistical model. However, the real world is complex and ever-changing, including swaying trees, ripples in water, flickering displays, and changing lighting. To handle these complexities, background models become increasingly complex, posing a challenge to the real-time processing requirements of the system. Real-world video surveillance demands that background models not only handle complex situations well but also consider the feasibility of real-time computation. Most current background models are built on a pixel-by-pixel basis, treating each pixel as an independent random variable, with each pixel individually determined as either background or foreground (target). However, a single pixel alone sometimes doesn't reveal much, such as noise points. In fact, for regions without targets, the structure of the image itself is relatively stable. Therefore, analyzing only a single pixel during target extraction generates a lot of redundant information, inevitably affecting the actual execution efficiency of the algorithm. Processing multiple adjacent pixels as a whole is a way to reduce computational redundancy and improve execution efficiency. Therefore, this paper proposes a method for constructing a multi-pixel background model based on image patches. 2 Previous Related Work The background model is the core of background extraction methods, and many background models have been proposed for different applications. Currently, research focuses on the robustness of the background model to environmental changes and its sensitivity to moving targets. Now, people generally abandon the use of static background models and use first-order filters to update the current background model in real time. The basic background model extraction method first requires finding the difference between the current image and the background image in the three RGB channels, and then classifying which pixels are foreground and which are background using a threshold. This is an early background extraction method, reflecting the original idea of moving target extraction methods based on background models. The w method, proposed by Haritaoglu, uses a background model that assigns three values to each pixel: the maximum value (Max), minimum value (Min), and maximum difference (D) between two consecutive frames within a training period. For each pixel, the following inequality is used to determine whether it is a target or background: Given Min, Max, and D, if pixel (P) satisfies the following inequality, it is considered foreground. Wren proposed a single Gaussian model. This model represents pixels in YUV form, with each pixel value distribution corresponding to a Gaussian function P(X, U). A threshold is determined, and Mahalanobis distance is used to classify each pixel: if X ≠ U, then X is classified as background; otherwise, it is classified as foreground (i.e., target). The adaptive improvement form of the single Gaussian model uses a simple adaptive filter to continuously update the parameters of the Gaussian model: where U is a positive constant less than 1. However, the above model does not perform satisfactorily in dynamic (multimodal) natural environments, such as environments with swaying trees or ripples in water. In order to effectively handle this situation, Friedman and Russell proposed a hybrid Gaussian background model for traffic monitoring systems in 1997. This background model treats the above dynamic situation as a discrete process, which all follow the Gaussian distribution, so that each pixel can be characterized by a multi-component Gaussian mixture distribution function. This method can model the multimodal background environment and is currently a relatively successful model for describing complex backgrounds. Many documents have discussed the Gaussian mixture model. The paper [4] describes a Gaussian mixture model algorithm with update capability, which uses the online K-means approximation algorithm to replace the accurate EM (ExpectationMaximum) algorithm. After that, many authors proposed improvements and extensions to this algorithm. A new hybrid model update algorithm was proposed. Not only are the parameters updated, but also an optimization algorithm is proposed for the number of components of the hybrid model. Reference [7] proposes a Bayesian formula based on the Gaussian mixture model and applies it to background segmentation. The background segmentation consists of two independent problems: one is to estimate the Gaussian mixture distribution of the pixel, and the other is to evaluate the probability of each component in the Gaussian mixture model becoming the background. The Gaussian mixture background model regards the value of a pixel over a period of time as a "pixel process", that is, the time series of pixel values, such as gray values or vector values of color images. For a specific pixel P, its history is known at time t: where is the image sequence, and each pixel is abstracted into a model of Gaussian mixture distribution with K components. Thus, the probability of each pixel is: (1) where K is the number of distributions in the mixture model, W is the weight estimate of the i-th Gaussian distribution in the mixture model, and and are the mean and covariance matrix of the i-th Gaussian distribution. At time t: Heikkila and Pietikainen proposed a texture-based background model method in [8], which assigns a model to each pixel. The model is a group of adaptive local binary histograms, which are calculated by the annular region around the pixel. 3 Gaussian Mixture Background Model Based on Image Patch Most of the background models mentioned above are built in the form of single pixels. Although the single-pixel model has the advantages of accuracy and flexibility, it has the disadvantage of low system execution efficiency when l=0. This is because the single-pixel background model processes each pixel in isolation. In fact, in the area without moving targets, there is a certain structure between adjacent pixels. This structure is usually relatively stable. For example, when the external light changes, generally speaking, the pixels in the background become brighter or darker at the same time. Statistics show that the background with stable structure accounts for the absolute majority of the entire image, while moving targets account for only a small proportion of the image. Thus, when using the single-pixel background model to extract moving targets, a lot of redundant calculations will be generated. In order to solve this problem, we try to divide the image into blocks and use the features of the image blocks to build a background model. The following section presents a background modeling method for image patches. To simplify the discussion, we assume the image is a grayscale image. Let Km (x, y) be an m×m pixel image patch, and (x, y) represent the coordinates of the top-left corner of the image patch. In the case of a 2×2 image patch, there are five possible combinations of foreground pixel distributions: 3.1 Selection of Image Patch Features The image patch is defined as a square block of adjacent pixels. Taking a 2×2 image patch as an example, Figure 1 shows the ideal state of various foreground distributions in the image patch. Generally speaking, the larger the image patch, the fewer patches need to be processed, and the higher the efficiency. However, the sensitivity to local targets decreases, and the accuracy of the target deteriorates because the number of image patches with a small foreground proportion increases, such as the cases in (b) and (c) of Figure 1. As the image size increases, the selection of image patch features becomes more complex. Considering the sensitivity and accuracy of target recognition, as well as the efficiency of problem processing, it is generally more appropriate to choose nz=2. There are many ways to introduce features. The simplest case is to arbitrarily select a pixel in the image patch as a representative, or to use the mean of the image patch as a feature. Here we propose several schemes for extracting features. (1) The center point of the image block, i.e., the center point pixel; (2) Select a combination of several points as the representative point of the image block, for example, the points on the diagonal of the image block can be selected; (3) The mean of the image block, i.e. (4) The row mean or column mean of the image block; (5) The amplitude value of the image block; (6) The row amplitude value or column amplitude value of the image block. Since all features are linear combinations of pixels in the image block, when the pixel values follow a normal distribution, all features A as the result of linear operations on the original pixels also follow a normal distribution. If, where X is a vector composed of several pixels; v is a vector with the same dimension as X, then we have: (3) The change of pixel values in the image block can be reflected in the change of feature values, so a background model can be established for feature A (instead of pixel values) to determine whether the image block (instead of pixels) is the background or the target. For example, the w method can be used on feature A to find the maximum value, minimum value and maximum difference between two consecutive frames within a training period, and the inequality can be used to determine whether the image patch is a background image patch or a foreground image patch. Since the feature also follows the Gauss distribution, a single Gauss background model can also be established for feature A. The following is a block-based mixed Gauss background model for several features A selected from the image patch. 3.2 Adaptive Gaussian Mixture Model Based on Image Patch Features In the features given in Section 3.1, select one or several features to form a feature vector, let http://www.chuandong.com/uploadpic/THESIS/2008/3/20080310174649592244.jpg[/img]. The mixed Gauss distribution form and parameter update method given in reference [4] are adopted. Choose a time period and give the Gaussian mixture density (GMM) with K components: (4) where (5) is the proportion of the i-th component in the overall distribution, also known as the weight of the i-th component. Based on the assumption that pixels are independent of each other, it can be concluded that the features are also independent of each other. To simplify the calculation, we can further assume that they have the same variance, so the covariance matrix can be simplified to, where is the unit reduction. This assumption can avoid the increase of error caused by complex calculation. Relation (4) describes the feature vector of each image block, and then the matching is considered successful. Experiments show that for a 2×2 image block, taking 4 is more appropriate. If the current feature vector value cannot match any of the distributions, the distribution with the minimum probability will be replaced by another new distribution. This new distribution takes the value of the feature vector of the current image block as the mean and assigns it a large initial variance and a small weight, i.e., oro. where is a large initial variance. After the above matching work is completed, the next work is to update this model with the current value. First, for the current time t, the priority weights of the K distributions are adjusted as follows: (6) where is the learning speed. This modification can be explained as follows: when the i-th distribution is successfully matched, the weight of this distribution is smoothly increased; the weights of other Gaussian distributions are exponentially decayed. After this work is completed, these modified priority weights are normalized. The next step is to modify the distribution parameters. The Gaussian distribution parameters of the unmatched components are not modified, only the distribution parameters that match the new observation are modified, and updated as follows: (7) (8) Where, is called the update rate of the parameters. The background should have more stability than the foreground, and the probability of the background image patch appearing is much greater than that of the target image patch. In the mixed Gaussian distribution, the background component has a greater chance of being successfully matched in the image sequence, resulting in a larger priority weight. In addition, if a new target appears, it will either generate a new component or increase the variance of the existing component. The Gaussian functions are sorted by ratio, which is arranged according to the increase of Gaussian priority weight and the decrease of variance. Compared with the components in the later order, the components in the earlier order are more likely to be the background than the components in the later order. The priority weights are summed from front to back, and the first distribution occupying the sum is defined as the background. Here, represents the measure of the minimum background portion of T across the entire distribution. The best distribution is selected until its sum of weights occupies a portion of the data T. If the features of an image patch match the first B components, it is determined to be background; otherwise, it is foreground. One advantage of the hybrid Gaussian model is its ability to handle foreground interference well. When a moving target stops and becomes background, this algorithm does not destroy the existing background model; the original background distribution remains in the hybrid model until it becomes the smallest and is replaced by a new component. However, when the target stops for a short time and then resumes movement, the original background is quickly recovered. Conversely, for a stationary target, updates over several frames can cause it to melt into the background, improving the absorption speed of stationary targets. 4. Target Segmentation Due to image noise and image patch errors, denoising is generally performed before dividing the region into connected sets. This step can be achieved through mathematical morphology operations, using erosion and dilation operators to remove isolated noise points, fill in small L-shaped areas in the target region, and merge adjacent connected regions. We can do this because most video surveillance applications assume that the moving target area is larger than a certain size. After noise removal, connected component detection algorithms are used to find the connected sets. Commonly used connected component detection operators are 4-connected and 8-connected. The found connected components are an equivalence relation of the foreground points, divided into several equivalence classes (connected sets), i.e., each target. Since the particles of image patches are larger than pixels, the obtained foreground region has a certain error. From the perspective of moving target detection and tracking, no further processing is needed because the centroid of the detected target does not deviate much from the centroid of the original target. However, in some cases (such as target recognition), fine-grained processing of the target is required. In this case, the following methods can be used to process these foreground image patch regions. The time difference method subtracts the image patch region from the corresponding region in the previous image. However, when using this method to extract the target, the edge on the same side as the direction of movement is better, while the edge on the opposite side of the target's movement is worse. Another approach considers the continuity of the target space. Region growing can be used to calculate the mean value of pixels within a region. The pixels at the region's edges are compared to this mean value, and a threshold is set. If the difference exceeds this threshold, the pixel is removed from the foreground region. Additionally, edge processing can be applied to regions based on color consistency. However, these processes should not exceed the size of an image patch. Furthermore, using edge detection methods to extract the target's contour is another possible method for further processing image patch regions. 5. Experiments To analyze the effectiveness of the block background model, we conducted two sets of experiments using a set of 320×240 video images containing 420 frames. The video depicts a traffic intersection monitoring scenario with non-stationary trees, pedestrians, vehicles, and rotating billboards, making the scene relatively complex. The experiments compared and verified the pixel format and image block format of the Gaussian mixture model. 5.1 Selection of Experimental Parameters This experiment used Matlab 7.0 programming. The hybrid Gaussian background model contains four components, and the image patch size is 2×2. Two features were selected from each image patch. The parameters of the hybrid model are: background threshold in pixel form T=0.97, weight update rate=0.002, and 2.5 times the standard deviation; background threshold in block form T=0.98, weight update rate=0.002, and 4 times the standard deviation. The initialization of both algorithms is as follows: the pixel value of the first frame of the video image is used as the initial mean of the first Gaussian component, the initial value of the mean of the other three components is 0, the initial variance of all four Gaussian components is 4, and the initial weights of the four components are 0.4, 0.2, 0.2, and 0.2, respectively. 5.2 Analysis of Experimental Results At the beginning of the video, the traffic lights at the intersection stop vehicles in the vertical direction in the image. A rotating billboard in the right part of the image is detected as a moving target. Around frame 90, a white car appears on the right side of the image, followed immediately by a black car at the intersection. The experiment detected the vehicles appearing from the right, extracting 10 frames from the video: frames 101, 131, 181, 201, 231, 271, 281, 331, 351, and 371. The processing results are shown in Figure 2. In Figure 2, three images are grouped together. The first row shows the original frames of the image; the second row shows the results of detecting dynamic targets using a 2×2 block Gaussian blending background model; and the third row shows the results of detecting dynamic targets using a pixel-based Gaussian blending background model. The experimental results show that the block background model is comparable to the pixel-based background model in detecting larger targets. For smaller targets, the block background model performs better than the pixel-based background model, as can be seen from (al01), (b131), (h331), and (i351) in Figure 2. In terms of accuracy, the pixel-based background model is superior to the block background model. [align=center] [/align] Experiments also show that the processing speed of the block-based background model is significantly better than that of the pixel-based background model. Figure 3 shows the ratio of the average processing speed of the 2×2 block-based Gaussian mixture model and the pixel-based Gaussian mixture model under the same computing environment. It can be seen from the figure that the average processing speed of the 2×2 block-based background model is more than three times faster than that of the pixel-based background model. 6 Conclusion This paper proposes a method for constructing a background model using image blocks. The methods of image block construction and several commonly used features of image blocks are given. The method of constructing an adaptive Gaussian mixture model based on image block features is discussed, and experimental analysis of this method is presented. Experiments show that although the detection accuracy of moving targets is affected after block division, the detection rate of small moving targets is improved. In terms of computational speed, under the same system hardware and software environment, the processing speed of the 2×2 block-based model is more than three times faster than that of the pixel-based background model.