For over 40 years since Moore's Law was established, we have witnessed semiconductor chip manufacturing processes improve at a dazzling pace, with Intel microprocessors even exceeding 4GHz in clock speed. While increased clock speeds have improved program efficiency to some extent, more and more problems have arisen. Power consumption and heat dissipation have become bottlenecks hindering design, and chip costs have increased accordingly. When simply increasing clock speed can no longer achieve high performance efficiency, dual-core and even multi-core processors have become the only way to improve performance. After AMD broke Moore's Law and ended the frequency game, both Intel and AMD began to gradually launch processors based on dual-core, quad-core, and even octa-core processors. As engineers gradually devoted themselves to the development of new applications based on multi-core processors, they began to discover that by using these new multi-core processors and employing parallel programming techniques in application development, they could achieve optimal performance and maximum throughput, greatly improving the efficiency of application operation.
However, industry experts also recognize that parallel programming on multi-core processors presents a significant challenge for practical applications. Bill Gates articulated this as follows:
"To fully utilize the power of processors working in parallel, ... software must be able to handle concurrency issues. But as any developer who has written multithreaded code will tell you, this is one of the most challenging tasks in the field of programming."
For example, to write a multithreaded program in C++, the programmer must be very familiar with C++, understand how to divide the program into multiple threads and schedule tasks among them, and also understand the Windows multithreading mechanism, be familiar with Windows API calling methods and the MFC architecture, etc. Debugging multithreaded programs in C++ is considered a nightmare by many programmers.
Therefore, for engineers in the test and measurement industry, achieving efficiency gains under multi-core conditions in a traditional development environment means a large number of complex multi-threaded programming tasks, which decouples engineers from the tasks of automated testing and signal processing itself. As a result, making full use of the advantages of the architecture and parallel computing on current multi-core machines has become an "impossible" task for engineers.
LabVIEW reduces the complexity of parallel programming and enables rapid development of parallel-architected signal processing applications. Fortunately, the NI LabVIEW graphical development platform provides an ideal programming environment for multi-core processors. As a parallel programming language, LabVIEW can automatically distribute multiple parallel program branches into multiple threads and assign them to various processing cores, improving the efficiency and achieving optimal performance for computationally intensive mathematical operations or signal processing applications.
Let's take multi-channel signal processing analysis, a common application in automated testing, as an example. Since frequency analysis in multi-channel systems is a resource-intensive operation, it's crucial to distribute the signal processing tasks for each channel across multiple processor cores in parallel to improve program execution speed. Currently, from a LabVIEW programmer's perspective, achieving this previously "impossible" technological advantage requires only minor adjustments to the algorithm structure, without the need for complex and time-consuming code rebuilding.
Taking dual-channel sampling as an example, we need to perform Fast Fourier Transform (FFT) on the data from the two channels of the high-speed digitizer. Assume that both channels of our high-speed digitizer acquire signals at a sampling rate of 100 MS/s and analyze them in real time. First, let's look at the traditional sequential programming model for this operation in LabVIEW.
Figure 1. LabVIEW code executed sequentially
Like other text-based programming languages, the traditional method for handling multi-channel signals is to read the signals from each channel sequentially and analyze them one by one. The sequential programming model based on LabVIEW illustrates this well: the data from the 0 and 1 channels are read in sequentially, integrated into one array, and then analyzed and output by an FFT function. Although sequential structures can run smoothly on multi-core machines, they do not effectively distribute the CPU load. Even on a dual-core machine, the FFT program can only be executed on one CPU, while the other CPU remains idle.
In reality, the FFT operations for the two channels are independent. If the program can automatically distribute the two FFTs across the two CPUs on a dual-core machine, the program's efficiency can theoretically be doubled. This is precisely the case on the LabVIEW graphical programming platform; we can truly improve algorithm performance by parallelizing the processing of these two channels. Figure 2 shows LabVIEW code using a parallel structure; from a graphical programming perspective, it simply involves adding one parallel FFT function.
Figure 2. LabVIEW code executed in parallel
Since the larger the amount of data, the longer the processor time required for signal processing operations in engineering applications, parallelizing the original signal processing program through simple program modifications can improve program performance and reduce the total execution time.
Figure 3. For data blocks larger than 1M samples (100 Hz precision bandwidth), the parallel approach achieves 80% or higher performance gains.
Figure 3 illustrates the precise percentage increase in performance as the size of the acquired data block (in units of samples) increases. Indeed, for larger data blocks, the parallel algorithm method achieves nearly a 2x performance improvement. Engineers do not need to create special code to support multithreading; in a multi-core processor environment, with minimal programming adjustments, they can easily achieve a significant boost in signal processing capabilities by leveraging LabVIEW's automatic allocation of each thread to the multi-core processor, thus improving the performance of automated testing applications.
Further optimization of program performance
LabVIEW's parallel signal processing algorithms not only help engineers improve program performance, but also allow for a clearer division of the different uses of multiple processor cores in a project. For example, modules for controlling input sampling, displaying output, and signal analysis can be separated.
Take HIL (Hareware-in-the-loop) or in-circuit signal processing applications as an example. First, a high-speed digitizer or high-speed digital I/O module is used to acquire the signal, and digital signal processing algorithms are executed in software. Then, the results are generated through another modular instrument. Common HIL applications include in-circuit digital signal processing (such as filtering, interpolation, etc.), sensor simulation, and custom component simulation, among others.
Generally, HIL can be implemented using two basic programming structures: a single-loop structure and a pipelined multi-loop structure with queues. The single-loop structure is simple to implement and has low latency for small data blocks, but it is limited by the sequential structure of its stages, hindering concurrency. For example, since a processor can only execute one function at a time, instrument I/O cannot be performed while data is being processed. Therefore, the single-loop structure cannot effectively utilize the advantages of multi-core CPUs. In contrast, the multi-loop structure can better utilize multi-core processors, thus supporting much higher throughput.
For a multi-loop HIL application, data transfer can be achieved through three independent while loops and two queue structures. In this case, the first loop collects data from the instrument, the second loop performs signal processing and analysis, and the third loop writes the data to another instrument. This processing method is also known as pipelined signal processing.
Figure 4. Pipeline signal processing with multiple loops and queues.
In Figure 4, the top loop is a producer loop that collects data from a high-speed digitizer and passes it to the first FIFO queue structure. The middle loop acts as both a producer and a consumer. In each iteration, it receives (consumes) several datasets from the queue structure and independently performs 7th-order low-pass filtering on the contents of four different data blocks in a pipelined manner. Simultaneously, the middle loop also acts as a producer, passing the processed data to the second queue structure. Finally, the bottom loop writes the processed data to the high-speed digital I/O module. Thus, in a multi-core system, LabVIEW can automatically allocate the independently running loops in the above program structure to different processors. Furthermore, it can allocate the signal processing tasks of the four data blocks in the middle loop to different processors based on CPU usage, achieving performance improvements in a multi-core processor environment.
Parallel processing algorithms improve processor utilization in multi-core CPUs. In fact, total throughput depends on two factors: processor utilization and bus transfer speed. Typically, the CPU and data bus are most efficient when processing large data blocks. Furthermore, we can further reduce data transfer time by using PXI (PCI) Express instruments with even faster transfer speeds.
Leveraging NI's powerful parallel computing capabilities and high-speed PCIe data transfer combined with Intel's multi-core technology, and utilizing a Dell PowerEdge 2950 eight-core processor, NI simultaneously sampled and processed data from 128 channels at a rate of 10kHz (2.56MB/s). This enabled NI to accomplish the seemingly impossible task of converting a large amount of data from 88 magnetometers on the outer wall of the Tokamak device into a system of partial differential equations on a 64*128 grid, while simultaneously completing the entire calculation process within a mere 1ms!
As Dr. Louis Giannone, the German development lead, stated, "By using LabVIEW programming for parallel application control, we increased the speed by 5 times on an 8-core machine, enabling us to successfully achieve the 1ms closed-loop control rate requirement!"