Design of a Fast H.264 Encoding Algorithm Based on DSP Platform

The H.264/AVC video compression coding standard was developed by the Joint Video Experts Group (JVT), composed of ISO/IEC and ITU-T. It introduced a series of advanced video coding technologies, such as 4×4 integer transform, intra-frame prediction in the spatial domain, and inter-frame prediction techniques with multiple reference frames and various block sizes. Since its release, the standard has been widely praised by the industry for its high compression performance and network-friendly features. Especially after the JVT organization made a significant extension to the fidelity range in July 2004, the application scope of the standard was further expanded. However, the huge computational load has become a bottleneck for its widespread application. Considering the complexity of the H.264 protocol implementation, this paper proposes two approaches: firstly, to improve hardware processing speed and capabilities, using TI's latest Davinci TMS320DM6446 DSP chip as the hardware platform for the H.264 encoder implementation; and secondly, to improve algorithm efficiency. Finally, a design scheme for an embedded H.264 encoder based on this chip is proposed. 1 Hardware Platform 1.1 Introduction to Davinci DM6446 Chip The DM6446 adopts a dual-core architecture of DSP+ARM (core diagram shown in Figure 1). The CPU clock frequency of the DSP chip can reach 594 MHz. The introduction of ARM can free up some functions of the DSP in the control aspect, allowing the DSP to focus on data processing. The chip adopts an enhanced Harvard architecture bus. Its CPU has two data channels, eight 32-bit functional units, and two general-purpose register groups (A and B), which can execute eight 32-bit long instructions simultaneously. If these eight functional units can be fully utilized, a total instruction packet of 256 bits can be simultaneously allocated to eight parallel processing units. Under full pipelining, the instruction throughput of this chip will reach 594 × 8 = 4752 MIPS. The processor has dual 16-bit extension capabilities, and the chip can complete dual 16-bit multiplication, addition, subtraction, comparison, shift and other operations in one cycle. The chip internally supports two levels of cache: a 32 kB program cache (L1P) and an 80 kB data cache (L1D) in the first level, and a configurable 64 kB cache in the second level. The chip automatically maintains data consistency between these two cache levels. This cache support significantly speeds up CPU execution. The Davinci DM6446 features a dedicated video image processing subsystem. This subsystem includes a video front-end and a video back-end. The video front-end input interface receives image input information such as BT.656 from external sensors or video decoders; the video back-end output interface outputs the image for local image reproduction. The video front-end input (VPFE) interface consists of a CCD controller (CCDC), a preprocessor, a columnar module, an automatic exposure/white balance/focus module (H3A), and registers. The CCD controller can be connected to a video decoder, CMOS sensor, or charge-coupled device. The preprocessor is a real-time graphics processor. 1.2 H.264 Encoder Hardware Platform The core processing chip of this system is the Davinci DM6446, as shown in Figure 2. Two DDR chips are connected in parallel to form a 32-bit data width, with a space of 256 MB. After the analog video signal is introduced at "VIDEO IN", it is converted into a digital signal by the TVP5146 decoding chip and then input into the TMS320DM6446 chip for processing. The H.264 encoded bitstream can be output at the video end and saved to the local hard drive for easy debugging and inspection. Alternatively, it can be output through a 10/100M Ethernet physical layer interface for network transmission. Simultaneously, the locally reconstructed image can be directly displayed after D/A conversion by the internal OSD module and encoding module of the TMS320DM6446 chip. 2 H.264 Encoder Structure and Encoding Process 2.1 H.264 Encoder Structure As shown in Figure 3, the input image enters the encoder in macroblock units. Intra-frame or inter-frame predictive coding is selected according to the speed of image change. If intra-frame predictive coding is chosen, the first step is to determine whether the current block to be encoded contains a lot of detail, and then decide whether to further segment the frame. Next, using the blocks in the reconstructed frame μF′n as a reference, and combining the prediction modes of the surrounding blocks, the best prediction mode for the current block is selected. Finally, the predicted value of the current block is obtained from the corresponding block in the reconstructed frame μF′n and the selected prediction mode for the current block. Following this method, intra-frame prediction is performed on each macroblock in the image, thus obtaining the predicted value P of a frame. If inter-frame predictive coding is chosen, the current input frame Fn and the previous frame (reference frame) Fn-1 are sent to the motion estimator (ME). Through block search and matching, the offsets of each macroblock in the current frame relative to the corresponding macroblock in the reference frame can be obtained, which are commonly referred to as motion vectors. Next, the reference frame Fn-1 and the newly obtained motion vector MV are sent to the motion compensator (MC) to calculate the inter-frame prediction value P. The current frame Fn is subtracted from the frame prediction value P to obtain the residual Dn. After transformation and quantization, a set of quantized transform coefficients X is generated. Then, after entropy coding, it is combined with some side information required for decoding (such as prediction mode quantization parameters, motion vectors, etc.) to form a compressed bitstream, which is then transmitted and stored through the NAL (Network Adaptive Layer). 2.2 Encoder Encoding Flow Figure 4 shows the main flow of the H.264 encoder. The input frame image is first divided into units: it is divided into macroblocks as the basic units, and then several macroblocks are combined into slices, and slices are combined into slice groups. In this way, the slice and slice group to which each macroblock belongs are determined. Then, it is determined whether the input frame image is an I-Frame or a P-Frame. After the above work is completed, each macroblock can be encoded. After encoding each macroblock, the reconstructed image requires 1/4 pixel precision interpolation and reference frame buffer insertion. Only then is the encoding of one frame complete. 3. Motion Estimation Mode and Fast Rate Distortion Decision To reduce temporal redundancy in image sequences and achieve better compression, the H.264/AVC coding scheme employs motion compensation and prediction. Specifically, a prediction mode for the current encoded frame is generated from one or more previously encoded frames, and then predictive coding is performed. A variable block size motion prediction mode is used, with the luma block size varying from 16×16 to 4×4, containing many selectable modes, forming a tree-structured motion prediction mechanism. For I-frames (including intra-frame 4×4 and intra-frame 16×16), and for P-frames (including intra-frame 4×4, intra-frame 16×16, SKIP mode, inter-frame 16×16, inter-frame 16×8, inter-frame 8×16, inter-frame 8×8, inter-frame 8×4, and inter-frame 4×8), special SKIP modes are also provided for P-frames and B-frames, for a total of 11 modes. The existence of these selectable modes makes the coding method more flexible, and the coding accuracy is much higher than that of fixed-size block prediction. However, the increase in selectable inter-frame prediction modes inevitably increases computational complexity. Therefore, it is necessary to adopt an efficient decision-making method to select the block size combination, achieving both good coding efficiency and coding quality. 3.1 The Lagrange cost function is introduced as follows: where D represents the distortion between the reconstructed image and the original image; R(si, m) represents the number of bits occupied by the data and related parameters after macroblock encoding in the bitstream, which is generally obtained from encoding statistics, but for SKIP mode, the number of bits is 1 bit by default; λ represents the Lagrange product factor used when selecting the mode. For motion estimation, the Lagrange cost function can be used as the decision criterion for selecting motion vectors. According to equation (1), the cost function for making ME decision on a sample block si is as follows: This equation returns the best matching motion vector mi that produces the minimum cost, where M refers to the set of all possible encoding modes, m is the currently selected mode, and R(si, m) in equation (2) is the number of bits to be transmitted (entropy encoded) for the motion vector (mx, my). D(si, m) represents the prediction error for an image macroblock. There are two schemes for calculating this prediction error: when the prediction error is chosen as the absolute error, it is represented by (SAD), as shown in equation (3); when the prediction error is chosen as the squared difference, it is represented by SSD, as shown in equation (4): where A is the current coded macroblock. When using multiple reference frames for motion estimation, mi represents the selected best reference frame. When performing motion search, block si is first subjected to motion search with integer pixel precision. The minimum value of equation (1) is used as the matching standard. After obtaining the best matching point with integer pixel precision, the same method is used to perform matching search with 1/2 and 1/4 pixel precision. At the same time, the same operation is performed in multiple reference frames. The obtained function costs are compared to obtain the minimum value, and the best matching motion vector mi of block s is found. 3.2 Fast Prediction Mode Judgment Algorithm The fast algorithm, compared to the Lagrange cost function algorithm, can be implemented in the following two steps: (1) Calculate the cost function J based on the prediction mode. However, a simplified calculation method is used here. Each sampling mode is sampled in a staggered manner, such as downsampling the pixels in an 8×8 block, as shown in Figure 5. Then calculate SAD for the sampling points, denoted as SADi. The Lagrange cost function calculated only for the sampling points is as follows: J=[SAD(si,m)+λ?R(si,m)] First, calculate the cost function J for each of the above modes, and then select the three modes with the smallest cost to form a candidate mode set. (2) For each mode in the candidate mode set obtained in step (1), according to equation (1), calculate the cost based on rate distortion to achieve RDO-based mode selection, that is, the mode with the smallest C value is taken as the final prediction mode. 4. Test Results and Conclusions Currently, the H.264 encoder system designed based on the DM6446 platform is basically complete. We selected several common videos to conduct performance tests on the encoder, and the test data is shown in Table 1. The data shows that this H.264 encoder can work normally and exhibits good compression performance. Of course, this encoder only implements the basic level of the H.264 protocol and has not yet undergone more specialized optimization. Other parts of the protocol, due to their complexity, require further research. Along this direction, video compression can be further improved.

Design of a Fast H.264 Encoding Algorithm Based on DSP Platform

Read next

CATDOLL Maruko Soft Silicone Head

CATDOLL Kelsie Soft Silicone Head

CATDOLL Coco Soft Silicone Head

CATDOLL 138CM Ya TPE