introduction
Virtual view compositing is an important research topic in computer vision, used to generate images or videos that did not originally exist from existing ones. It is widely applied in 2D-to-3D video conversion, free-view television, virtual reality technology, and 3D video encoding/decoding. Depth-image-based rendering (DIBR) is the most commonly used method for virtual view compositing. This technique maps the original viewpoint video onto a virtual viewpoint using existing 2D images/videos and corresponding depth maps, creating virtual images/videos.
As early as 2002, ATTEST proposed separating 3D video into 2D video and depth map transmission, and then synthesizing the two or more video streams using DIBR when playing the 3D video at the user end. The Fraunhofer Institute for Communication Technologies (HHI) in Germany refined the DIBR algorithm to obtain higher stereoscopic video quality. Moscow State University in Russia has been researching virtual view synthesis for 10 years, and its company YUVsoft has mature products for 2D to 3D conversion, stereoscopic to multi-view video conversion, and stereoscopic video enhancement. The Tanimoto Laboratory at Nagoya University in Japan, in addition to participating in the development of ViewSynthesis Reference Software (VSRS), also provides effective depth map estimation algorithms and has publicly released its multi-view video dataset, providing a reference basis for research in academia and industry.
1. The principle of void formation and the difficulties in filling it
In depth map-based virtual view synthesis technology, the pixels of the original view are back-projected to world coordinates using depth values and original camera parameters. Then, the world coordinates are projected and transformed to the image plane of the virtual view using virtual camera parameters, which is called 3D warping.
During the compositing process, in the original viewpoint, some background is occluded by the foreground. However, in the new virtual viewpoint, this occluded background is exposed, and this portion is unknowable, resulting in void areas. Additionally, due to the limited field of view of the original camera, some boundary areas of the new viewpoint cannot find corresponding mapping areas in the original viewpoint. After 3D distortion, significant voids also exist at some boundaries of the virtual viewpoint. Eliminating these void areas is a crucial step in virtual viewpoint compositing.
Figure 1 shows an example of void formation.
Figure 1 illustrates an example of holes generated in a real-world scene. In the "Ballet" sequence, the 2D image and corresponding depth map of the original viewpoint are mapped to the virtual viewpoint position through 3D distortion. Because the foreground (ballet dancer and man) are closer to the camera, they partially obscure the background. This obscured background is exposed under the virtual viewpoint, forming a hole area (white area). Additionally, because the virtual viewpoint is positioned further to the left, a larger hole exists at the left boundary of the virtual viewpoint.
The hole problem is one of the most difficult problems to solve in virtual view compositing, for the following reasons:
(1) The area occupied by the cavity is relatively large.
The size of a hole is determined by the degree of deviation between the virtual view and the original view; the greater the distance between them, the larger the hole area, and vice versa. Typically, even when the distance is small, the horizontal width of a hole in a non-closed area is still more than 10 pixels. Because these holes are relatively large, they cannot be filled using simple linear interpolation methods.
(2) The true contents of the void are difficult to obtain.
For a single frame image, the content inside the hole is unknown; the content of the hole can only be predicted based on the pixel information around the hole. However, the prediction may not be accurate, especially for large holes, where the lack of information makes accurate reconstruction extremely difficult. Ensuring that the restored content looks "reasonable" for areas where the true value cannot be recovered is also a key challenge.
(3) Interference from foreground targets
Based on the principle of hole formation, holes need to be filled with background content. Distinguishing between foreground and background effectively during the filling process is inherently challenging. If the foreground object isn't excluded, the filled hole will often contain imperfections from the foreground object. Many methods employ measures to limit this, but imperfections still exist in certain areas, such as foreground edges.
(4) The virtual view must maintain continuity.
The human eye is very sensitive to the jumps between video frames. Too many jumps can cause discomfort to the viewer. Therefore, it is necessary to maintain the continuity of the content in the empty parts of the video. The content in the empty parts between frames should avoid inconsistency and flickering. Maintaining the consistency of content between frames is a key challenge.
2. Current Status of Research on Void Filling Methods
Filling of non-occluded areas in virtual views can be divided into two categories. The first category is to preprocess the depth map to reduce the generation of non-occluded areas. The second category is to not preprocess the depth map and use the temporal or spatial correlation of the video to obtain filling information to fill non-occluded areas.
2.1 Preprocessing Depth Map Method
Holes in the virtual view are mainly caused by abrupt changes in the depth map, especially at the boundary between the foreground and background, where the depth values change drastically, leading to holes. Depth map preprocessing methods use low-pass filters to remove these abrupt changes, making the depth map variations smoother and reducing holes in the virtual view.
Depth map preprocessing can employ symmetric or asymmetric Gaussian low-pass filtering. Symmetric Gaussian low-pass filtering produces severe distortion, such as magnifying foreground objects, known as the rubber sheet effect. To address this, asymmetric Gaussian low-pass filtering can smooth the depth map to varying degrees in both the horizontal and vertical directions, with a greater smoothing effect in the horizontal direction than the vertical. Due to its anisotropic nature, asymmetric Gaussian low-pass filtering reduces the generation of holes to some extent. While both symmetric and asymmetric Gaussian low-pass filtering can smooth edge regions in the horizontal direction, they also smooth non-hole regions, leading to a decrease in image quality in non-hole areas. To overcome this problem, edge-dependent Gaussian filters smooth edges only in the horizontal direction, while adaptive edge-oriented smoothing filters do not smooth non-hole regions during depth map preprocessing. Depth map preprocessing methods are only suitable for situations with small baselines between cameras and are difficult to apply to situations with large baselines, meaning they cannot fill in large holes.
The depth map smoothing method greatly reduces the 3D effect of the generated virtual view because most of the depth information is filtered out. The original intention of the synthesized 3D view is to give people a strong sense of layering, but smoothing the depth map goes against the original intention. The distortion of the depth map also causes the generated virtual view to be distorted, especially the foreground object is deformed and the texture in the vertical direction is misaligned.
2.2 Non-preprocessed depth map method
Another type of approach does not preprocess the depth map but utilizes the correlation between the temporal or spatial domains of the video to obtain padding information. Based on the type of correlation used, these methods can be divided into three categories: spatial domain-based methods, temporal domain-based methods, and spatiotemporal domain-based methods.
(1) Spatial domain-based method
This method utilizes the spatial correlation of images within a frame to fill holes based on background information surrounding them. In the spatial domain, view blending methods can fill most hole areas using information from multiple viewpoints. View blending methods require multiple camera acquisition devices and transmission bandwidth, resulting in higher costs; therefore, single-viewpoint applications have gained wider attention. Layered view synthesis methods gradually fill holes in the virtual view through downsampling and upsampling, producing views without geometric distortion, but blurring effects occur when the hole area is large. Currently popular hole-filling methods that do not produce blurring effects are based on image inpainting. Criminisi et al. proposed a scheme combining image inpainting and texture synthesis. It first calculates the priority of hole boundary pixels, then searches for matching blocks from non-hole areas and fills them into the highest-priority region. Directly using image inpainting methods can effectively fill large holes, but the filled area contains numerous foreground defects. To mitigate this problem, many image inpainting enhancement methods utilize depth information to exclude the foreground from the filling process. Daribo and Saito incorporated depth information into the image patch priority and distance calculation process of the Criminisi algorithm, assigning higher priority to image patches with lower variance, and then selecting the best-matching image patch from regions with similar depth values and colors. Gautier et al. also extended the Criminisi algorithm, using the structure tensor of the DiZenzo matrix to define data terms, and also incorporated depth information into the calculation model of the optimal image patch. The literature assumes that the depth map of the virtual view is provided, which is unreasonable in practice. Methods by Ahn and Kim, Köppel et al., and Buyssens et al. also use depth maps to improve the priority of filling and the selection process of image patches, and these methods do not require the virtual view's depth map to be provided; instead, they simultaneously repair the virtual view's depth map while repairing holes in the virtual view.
(2) Time-domain based methods
In the temporal domain, due to foreground motion, areas occluded by the foreground in the current frame may become visible in other frames. Therefore, background modeling methods can be used to recover the background of occluded areas. The average background model first segments the background from the scene and then dynamically updates it to form a stable background; this method is only suitable for scenes with quasi-static backgrounds. Temporal background models use depth value variance to search for unoccluded background information in the forward and backward directions of the reference video, and median filtering of this information forms the reference background. However, suitable background information is limited by adjacent time periods. One paper first predicts the depth values of non-occluded regions, then uses the depth values as a threshold to segment the foreground and background, and updates the background depth map and background video. Another paper uses a Gaussian mixture (FDC) background modeling method to generate the background video and incorporates foreground depth correlation correction (FDC) to remove rotating and stationary foregrounds. However, when the depth map is inaccurate, FDC introduces foreground defects into the background. To reduce computation and improve scene adaptability, another paper proposes an online switchable Gaussian model. The framework proposed in the literature, based on background modeling, first extracts the foreground and then models the background, and employs motion estimation, which can avoid foreground texture defects and is applicable to moving backgrounds.
(3) Spatiotemporal domain-based methods
Wexler et al.'s video inpainting method employs a global optimization approach, ensuring optimal continuity of the filled content in both the temporal and spatial domains, enabling the filling of large-area holes in video sequences. However, this method, by seeking optimal matching between spatiotemporal image blocks, suffers from computational complexity that increases proportionally with time and image size. Even with the coarse-to-fine optimization method described in the paper, its runtime remains too high for practical applications. The PatchMatch method significantly reduces the computational complexity of block matching, achieving near-optimal results. Newson et al. extended the PatchMatch optimization method to the spatiotemporal domain, further improving video inpainting performance and speed. Huang et al. proposed a video inpainting method suitable for moving backgrounds. Directly using video inpainting methods to fill holes introduces numerous foreground artifacts. Choi et al. considered the correlation between the temporal and spatial domains, adding a temporal dimension to the Criminisi inpainting algorithm, which originally only utilized spatial correlation information. They also used intra-frame and inter-frame correlation information to fill holes, resulting in greater continuity between frames and reducing flickering. Hsu et al. and Kim et al. have successively proposed spatiotemporal void filling methods based on energy function optimization, which can also ensure the spatiotemporal continuity of the repaired video.
3. Review of various methods
Hole filling is a key problem that needs to be solved in virtual view compositing. Ideal hole filling needs to meet the following characteristics: 1) reflect the real value; 2) look natural and intuitive; 3) maintain the continuity of the temporal domain; 4) process quickly.
For methods based on depth map preprocessing, the smoothing of the depth map reduces the 3D effect of the generated view and makes it unsuitable for handling scenes with large areas of holes. Although the algorithm is simple and fast, the virtual view is severely distorted.
Image inpainting methods can recover content from unknown areas and fill large holes without causing blurring. However, directly applying image inpainting methods to fill non-occluded areas may result in foreground content being sampled to fill the hole, leading to a "foreground penetration" phenomenon. Some inpainting-based enhancement methods utilize the characteristics of non-occluded areas to better fill holes, but these improvements require specific inpainting techniques and may not be applicable to other inpainting methods, lacking general applicability. Furthermore, while the inpainted content may appear similar to the surrounding content, it may not necessarily be the actual content.
Background modeling methods can recover part of the background, but they cannot recover the part of the background that is occluded by a static foreground in the video, or some boundary parts that cannot be found in the original view. Moreover, traditional background modeling methods will bring some foreground textures into the constructed background, or are not suitable for scenes with camera movement. The background modeling framework proposed in the literature can avoid foreground texture defects, but it does not consider the continuity between frames and has a flickering phenomenon.
The spatiotemporal infilling method uses global optimization to select the best spatiotemporal image patch to fill the gap, but the selection of the best matching spatiotemporal image patch each time involves traversing all frames in the video, which results in high computational complexity.
In conclusion, there is currently no ideal method that can simultaneously satisfy the requirements of reflecting the true value, maintaining continuity, and high speed.