Research on Real-time Audio Transmission in Ethernet Environment
2026-04-06 04:50:27··#1
With the rapid development of network technology, VoIP technology has been widely applied. Especially in local area network (LAN) environments, VoIP, with its advantages of ease of use and low cost, has become one of the main methods for instant communication. From a practical application perspective, latency has become a key factor affecting VoIP voice quality. ITU-T G.114 specifies that an acceptable latency for high-quality voice is 300 ms. Generally, if the latency is between 300 and 400 ms, the interactivity of the call is relatively poor, but still acceptable. When the latency is greater than 400 ms, interactive communication becomes very difficult. Therefore, ensuring real-time audio transmission has become one of the primary problems to be solved in VoIP technology. This paper first introduces the principles and basic implementation process of VoIP, then conducts an experimental study on real-time audio transmission in an Ethernet environment, analyzes the impact of buffer settings and audio API calls on audio latency, and proposes countermeasures to solve Ethernet audio latency based on the analysis results. 1. VoIP Principles and PC-Based Implementation Process The basic principle of VoIP is: the sending end compresses the acquired raw voice data using a voice compression algorithm, then packages the compressed voice data according to the TCP/IP standard, and sends the data packets to the receiving end via the IP network; the receiving end reassembles the packetized voice data, decompresses it, and restores the original voice signal, thus achieving the purpose of transmitting voice over the network. Figure 1 shows the VoIP implementation process based on a PC platform. As shown in the figure, the basic implementation of a VoIP application based on a PC platform consists of three parts: a receiving module, a sending module, and network transmission. The sending module mainly consists of audio acquisition, audio encoding, and packetized voice encapsulation. The implementation process of the receiving module is generally the reverse process of the sending module, mainly including packetized voice reception, audio decoding, and audio playback. Figure 1: VoIP Implementation Process Based on a PC Platform The functions of each part and the conventional implementation methods are described below. The audio acquisition and playback module mainly performs audio signal acquisition and playback operations, completing the conversion between analog and digital voice. It primarily implements its functions through audio API functions. In the Windows operating system, common audio API functions include WaveX, DirectSound, and ASIO. The audio encoding and decoding module mainly performs the compression and decompression functions of voice data. At the sending end, due to the large amount of raw voice data acquired, it needs to be compressed and encoded in a specific audio format. Similarly, at the receiving end, the received voice data needs to be decompressed and restored. In the Windows operating system, the ACM (Audio Compression Manager) manages all audio codecs (CODECs) in the system, responsible for compressing and decompressing voice data. A CODEC is a small piece of code used to compress and decompress data streams. A CODEC can be a CODEC included with the operating system itself, or it can be installed by applications installed on the system. The packet voice encapsulation and packet voice receiving module mainly adds the appropriate header to the compressed voice data, making it a voice packet, and then sends it to the transmission module. The TCP/IP protocol suite contains two distinct transport layer protocols: the connection-oriented Transmission Control Protocol (TCP) and the connectionless User Datagram Protocol (UDP). The difference lies in their service: UDP provides a connectionless service, requiring no connection establishment before data transmission, and the remote host does not need to provide any acknowledgment upon receiving UDP data; TCP, on the other hand, provides a connection-oriented service, requiring a connection to be established before data transmission and released after transmission. For audio applications, UDP is generally used. This is because, although UDP does not provide error retransmission functionality, it guarantees the real-time nature of audio data. The network transmission module sends encapsulated IP voice data packets from the sender to the receiver. In Windows operating systems, this is primarily accomplished through the Winsock function. 2. The Relationship Between Buffer Size and Latency Buffer size has a close relationship with latency. Generally, a larger buffer results in higher latency, but it allows for effective out-of-order reordering operations, leading to better voice quality. A smaller buffer results in lower latency, but because the buffer does not effectively eliminate latency jitter, the voice quality is poorer. Therefore, the buffer size should be set appropriately to minimize latency while maintaining good voice quality. The experimental program is a PC-to-PC VoIP program we wrote previously, written in VC++. It uses the low-level audio API—WaveX functions—for audio acquisition and playback; ACM for voice compression and decompression; and Winsock for network communication. The program implements the basic functions of network voice transmission. The acquisition and playback buffers are the same size, two in each case, using a ping-pong system. We measured the relationship between buffer size and end-to-end latency in an Ethernet environment. The end-to-end latency measurement method is as follows: run the program, input an stimulus from the microphone, and get an output from the headset. The difference between the two times is the end-to-end latency. The test can be run by calling the same device from the same device, eliminating the need to consider synchronization issues. Furthermore, since the test environment is based on a 100 Mbit/s Ethernet link, the link transmission latency is in the microsecond range and negligible. Therefore, the results obtained from the local loopback test can basically characterize the end-to-end latency. The specific measurement method involves generating an appropriate signal using an oscilloscope to simulate voice input, and then observing the output to obtain the latency. The codec algorithm used in the test program was GSM610, with a sampling frequency of 11.025kHz, 8-bit mono mode, and WaveX audio API. Measurements were performed, and the results are shown in Table 1. Table 1: Relationship between Buffer Size and Delay Buffer Size (bytes) 512 768 1024 1536 2048 4096 Speech Duration (ms) 46 70 93 140 196 392 Measured End-to-End Delay (ms) Approx. 350 Approx. 400 Approx. 500 Approx. 600 Approx. 700 Approx. 800 In the above test environment, each sample point was quantized to one byte, the sampling frequency was 11.025kHz, and the size of the raw speech data generated per second was 11025 bytes. The speech duration is the buffer size divided by 11025, so the speech duration should also be the buffer delay. In the experiment, we found that when the buffer is 512 bytes, although a small buffer delay can be obtained, the pause in the voice is very obvious and the sound quality is very poor. If the buffer is set to 768 bytes, the sound quality can be significantly improved, but the packing delay is not increased much. Therefore, in the later experiments, we set the buffer to 768 bytes. As can be seen from Table 1, when the buffer is increased, the delay increases significantly. However, when the buffer is quite small (512 bytes), the delay is not significantly reduced and remains stable at around 350ms, while the corresponding voice duration is only 53ms. Obviously, in addition to buffer packing and transmission, other factors in the VoIP transmission path also introduce a large delay. The third part of this paper will analyze the specific composition of end-to-end delay in detail. 3. Composition of delay in Ethernet environment The delay in VoIP exists in all aspects of the entire IP phone, as shown in Figure 2, and can be roughly divided into 4 parts: (1) Audio acquisition and playback delay. Caused by the audio API. (2) Buffer delay. Buffer delay is the delay introduced by the sender's buffer excluding waiting time and the receiver's unpacking. As shown in the experiment in Section 2 of this paper, buffer delay is related to the size of the buffer. (3) Voice encoding/decoding delay. Caused by the voice encoding algorithm, its value varies depending on the algorithm, but the difference is not large, and the empirical value is between 5 and 40 ms. (4) Network transmission delay. Network transmission delay is the time required for data to reach its destination through network transmission. Figure 2 VoIP delay distribution[/align] Since Ethernet has a large bandwidth and short distance, the network delay is generally less than 1 ms and can be ignored. Therefore, the delay of VoIP in the local area network environment is mainly composed of voice encoding/decoding delay, packet/buffering delay and audio acquisition and playback delay. In order to further determine the distribution of delay of each part of VoIP under Ethernet conditions, we used the QueryPerformanceCounter function to set the timestamp in the experimental program for specific experimental analysis. The QueryPerformanceCounter function can accurately time. We conducted loopback test on the local machine, using GSM610 encoding/decoding, a sampling frequency of 11.025kHz, 8-bit mono mode, a 768-byte buffer, and WaveX as the audio API. We measured the latency of the audio acquisition, compression, decompression, and playback sections of the program. The original audio data in the experiment was the size of one buffer. The experimental results are shown in Table 2: Table 2 Latency Composition of Each Part of the WaveX Program Audio Acquisition Latency Compression Latency Decompression Latency Audio Playback Latency Approx. 180ms Approx. 5ms Approx. 5ms Approx. 200ms By adding the latency of each part, we can obtain an end-to-end latency of approximately 390ms. This is basically consistent with the experimental results in Section 2 of this paper, indicating that our experimental results are reliable. According to the experimental results, we can see that the main components of the latency come from the audio acquisition latency and the audio playback latency. After deducting the buffer latency (voice duration) of 93ms, there is still approximately 200ms remaining, which should be caused by the low-level audio API - WaveX. 4. Analysis of Ethernet Latency Reduction Strategies Based on the experimental results in Section 3, to reduce latency, we must consider using a higher-performance audio API. We modified the program, using DirectSound instead of WaveX for audio acquisition and playback. WaveX lacks hardware acceleration, has high CPU utilization, and significant latency. DirectSound, an audio component of the DirectX API, provides fast mixing, hardware acceleration, and direct access to relevant devices. DirectSound allows for waveform capture and playback, and can obtain more services by controlling hardware and corresponding drivers. Compared to WaveX, DirectSound is newer, more powerful, supports mixing, hardware acceleration, and has lower latency during acquisition and playback. The following is a brief introduction to the steps of implementing DirectSound: The DirectSound sound acquisition process is shown in Figure 3. The DirectSoundCaptureEnumerate function enumerates all recording devices in the system, the DirectSoundCaptureCreat function creates a device object, and then the CreatCaptureBuffer function creates a recording buffer object. The SetNotificationPositon function sets a notification bit to periodically copy data from the recording buffer. The DirectSound sound playback process is shown in Figure 4. The DirectSoundCapture, DirectSoundCreat, and CreatSoundBuffer functions also perform some initialization work. The Lock function locks the buffer location. Then, the WriteBuffer function writes the audio data to the buffer, and the UnLock function unlocks it after writing. Figure 3: Sound Acquisition Flowchart; Figure 4: Sound Playback Flowchart. In the same experimental environment as in Section 3, we measured the latency of the DirectSound program. The end-to-end latency measured by the oscilloscope was approximately 250ms. The timestamp measurement results are shown in Table 3. Table 3 shows the latency composition of various parts of a program using DirectSound: Audio Acquisition Latency, Compression Latency, Decompression Latency, Audio Playback Latency (approx. 120ms, approx. 5ms, approx. 5ms, approx. 130ms). Based on the experimental results, we can see that the latency of a program using DirectSound is significantly lower than that using WaveX. In addition, ASIO (Audio Stream Input Output) can also be used. ASIO can enhance the processing power of the sound card hardware and greatly reduce the system's latency to the audio stream signal. The audio acquisition latency of ASIO can be shortened to a few milliseconds. However, it requires the support of a professional sound card, is complex to use, and is difficult to implement. 5. Conclusion This paper analyzes the end-to-end latency of VoIP applications in a LAN environment and verifies through experiments that the audio transmission latency in an Ethernet environment mainly consists of buffer latency and API call latency, with the API call latency being the most significant component. Therefore, when developing Ethernet VoIP application systems, it is crucial to focus on optimizing the implementation strategies of these two parts to improve voice quality.