An MPEG-2 compression encoder is a front-end device that compresses and encodes analog television audio and video signals using MPEG-2 to output a real-time TS stream. It is suitable for digital television transmission or front-end source encoding, as well as various applications such as video conferencing and distance education. An advanced encoder should not only have a DVB interface but also a telecommunications interface so that the device can be easily used in HFC networks, microwave MMDS or 8GHz systems, SDH or PDH networks, as shown in Figure 1.
Figure 1 shows the structural block diagram of the encoder.
MPEG-2 video coding system and key technologies
The principle of MPEG-2 image compression utilizes two characteristics of images: spatial correlation and temporal correlation. Any scene within a frame is composed of several pixels; therefore, a pixel usually has a certain relationship with its surrounding pixels in terms of brightness and chromaticity—this relationship is called spatial correlation. Similarly, a scene in a program is often composed of a sequence of consecutive images; there is also a certain relationship between consecutive frames within an image sequence—this relationship is called temporal correlation. These two types of correlation result in a large amount of redundant information in the image. If we can remove this redundant information and retain only a small amount of non-relevant information for transmission, we can greatly save transmission bandwidth. The receiver, using this non-relevant information and a specific decoding algorithm, can recover the original image while maintaining a certain image quality. A good compression coding scheme is one that can remove redundant information from the image to the greatest extent possible.
Encoded images in MPEG-2 are divided into three categories, namely I-frames, P-frames, and B-frames.
I-frame images use intra-frame coding, meaning they only utilize spatial correlation within a single frame and not temporal correlation. I-frames are primarily used for receiver initialization, channel acquisition, and program switching and insertion. I-frame images have a relatively low compression ratio and appear periodically in image sequences, with the frequency selectable by the encoder .
P-frames and B-frames employ inter-frame coding, utilizing both spatial and temporal correlations. P-frames use only forward temporal prediction, improving compression efficiency and image quality. P-frames can contain intra-frame coded portions; each macroblock in a P-frame can be either forward-predicted or intra-frame coded. B-frames use bidirectional temporal prediction, significantly increasing compression ratio. It's important to note that because B-frames use future frames as references, the transmission and display order of image frames in the MPEG-2 encoded bitstream differs.
The MPEG-2 encoded bitstream is divided into six layers. To better represent the encoded data, MPEG-2 uses syntax to define a hierarchical structure, consisting of six layers from top to bottom: Picture Sequence Layer, Group of Pictures (GOP), Picture, Macroblock Strip, Macroblock, and Block. The main applications of the MPEG-2 standard are as follows: preservation of audio and video data; non-linear editing systems and networks; microwave, satellite, and fiber optic transmission; and broadcasting of television programs. In all-digital television technology, two crucial encoding technologies are source coding and channel coding, both employing MPEG-2 technology. Source coding primarily addresses the compression and preservation of image signals, while channel coding primarily addresses their transmission. Image signals are large in volume; without compression, digital television signals cannot be transmitted in real time. The main method of compression is to remove redundant signals. Redundant signals refer to those parts that are unrelated to information or have little impact on image quality—this is the principle behind MPEG-2 image compression.
(1) Spatial redundancy. An image consists of hundreds of thousands of pixels. There is a great similarity (or correlation) between two or even a few adjacent pixels. During transmission, many identical data will be transmitted continuously. This is called spatial redundancy. By using a certain encoding method (such as orthogonal transform encoding), the redundant information in space can be removed, thereby reducing the transmission and recording bit rate.
(2) Temporal Redundancy. Television images also exhibit strong temporal correlation. For a 25 frames per second image, the difference between the previous and subsequent frames is usually very small, with most of the content being the same. This indicates a very high correlation between adjacent images. The correlation between images gradually decreases only when they are far apart. Moreover, the changes in these highly correlated images are generally regular, meaning that the changes in each image are predictable. By utilizing the temporal redundancy characteristic of images, redundant information in the image signal can be removed in time, which can also reduce the transmission and recording bit rate.
(3) Statistical Redundancy. After image and sound signals are digitized, they follow certain statistical laws. For example, in an image predictive coding system, the predicted value of the current pixel signal is predicted from the values of the previous few adjacent pixels or the time value of the pixel in the previous period. According to the spatial and temporal correlation of the image, signals with small prediction errors have a high probability of occurrence, and vice versa. By using statistical coding, short codes are used for small error signal values with a high probability of occurrence, while long codes are used for large error signal values with a low probability of occurrence. This removes statistical redundancy in the signal.
(4) Perceptual Redundancy. Human visual and auditory organs have certain insensitivity. Perceptual redundancy refers to the fact that for audio-visual signals that are not easily distinguishable or can not be distinguished by human vision and hearing, even with significant distortion, people will not notice a significant reduction in image and sound quality, or may not even be aware of it. Therefore, during encoding, long codes and short codes can be used to encode different content. This is called selective encoding, which aims to reduce the bit rate.