Design of a Video Traffic Detection System Based on a Quad-Core DSP

Traffic information video detection systems are devices that acquire traffic information data through image analysis and are an important component of Intelligent Transportation Systems (ITS) [1-2]. This system uses cameras mounted above the road as sensors to transmit road traffic images to the traffic information video detection system. The images are analyzed in real time to extract vehicle traffic information data (including traffic flow, vehicle speed, vehicle density, etc.), which is then sent to the traffic information control center via a communication link. Such systems have advantages such as high accuracy, long lifespan, and ease of maintenance. Furthermore, the large volume of traffic image data, the constantly evolving processing algorithms, and various practical needs place increasingly higher demands on hardware system performance. A single processor will inevitably be insufficient, and parallel, general-purpose, and powerful multi-processor systems are gradually gaining attention and application. This paper proposes a novel design scheme for a traffic information video detection system based on a quad-core DSP parallel architecture. This system uses four DSP processors to process image data in parallel, greatly improving the system's data processing capability and transmission performance. 1 Overall Scheme of the Video Detection System Currently, traffic information video detection systems are relatively complex, unstable, expensive, and lack real-time performance. They require dedicated personnel for management and are cumbersome to operate. The system block diagram of this design, as shown in Figure 1, adopts a 4-core DSP structure. Through the connection of communication interfaces between the four system units, four digital signal processors (DSPs) are combined, demonstrating the advantages of a 4-microprocessor system. The system units implement the detection algorithm and exchange data with external devices. During system operation, the CCD camera acquires traffic flow image signals, which are then converted from analog to digital to obtain digital video data. The digital video data is stored in a video buffer FIFO. Once a line is full, an interrupt request signal is sent to the 4×DSP system. The DSP interrupts the CPU, transmits the digital video data to the internal SDRAM, completes the acquisition of the digital video image and the separation of YUV variables, and synthesizes a complete frame of digital image data. Then, an interrupt notification algorithm processing program is generated to process the image, and the result is stored in a buffer pre-defined in the DSP address space, waiting for external devices to retrieve the detection results for subsequent processing. 2. Introduction to DSP Since its inception in 1982, the DSP (Digital Signal Processor) has experienced rapid development. This paper uses four high-end DSPs from TI (Texas Instruments) – the TMS320C6416 – which features high clock speed, dual external address and data buses, making it highly suitable for image processing and other fields. The characteristics of the chip are as follows, and detailed information can be found in reference [3]. (1) The DSP core adopts a very long instruction word (VLIW) architecture, with 8 functional units and 64 32-bit general-purpose registers. It can execute 8 instructions simultaneously in one clock cycle, and its computing power can reach 4800 MIPS (millions of instructions per second), supporting 8/16/32/64 bit data types. The two multiply-accumulate units can execute 4 groups of 16×16 bit multiplications or 8 groups of 8×8 bit multiplications simultaneously in one clock cycle. Each functional unit has added additional functions in hardware to enhance the orthogonality of the instruction set. In addition, some instructions have been added to reduce code length and increase the flexibility of registers; (2) In order to ensure that data can be supplied to the ultra-fast DSP core, the TMS320C6416 adopts a two-level ultra-fast cache, namely a 16 KB level 1 data cache, a 16 KB level 1 program cache and a 1024 KB unified data and program memory. To achieve greater expansion, 256 KB of the 1024 KB memory can be configured as a secondary cache; (3) The memory interface of the TMS320C6416 provides a non-terminal interface to SDRAM, SBSRAM, asynchronous devices such as SRAM/ROM, and can also be connected to external I/O devices; (4) In the TMS320C6416, a PCI interface is added, supporting a 32-bit wide address and data multiplexed bus, with a maximum operating frequency of 33MHz; (5) Compared to the tens of watts of general-purpose CPUs, the power consumption of DSP devices is generally in the range of several watts or even milliwatts, which shows a unique advantage in various power-sensitive applications, while eliminating the need for a complicated heat dissipation system. This paper uses the C6416, with an I/O voltage of 3.3 V and a core voltage of 1.2 V. When the clock frequency is 600 MHz, the maximum power consumption of the DSP is less than 1.6 W. 2.1 Parallel Image Processing System of 4×DSP A new type of parallel image processing system is constructed using four TI high-end digital signal processors TMS320C6416. The system is interconnected through a synchronous 4-port SRAM and a system bus, combining the advantages of tightly coupled and loosely coupled parallel systems [4]. 2.2 Parallel System Structure of 4×DSP Image processing algorithms are flexible and diverse, and are constantly developing rapidly. In order to meet the increasingly complex image processing algorithms and the gradually increasing image scale, for the sake of versatility, the processors in the system need a flexible, high-bandwidth communication and handshaking mechanism. Figure 2 shows the block diagram of the designed parallel system, which uses four TMS320C6416 chips and can quickly complete tasks that previously took a long time to complete on a single computer. As can be seen from Figure 2, the system is designed based on the architecture of tightly coupled and loosely coupled systems, combining the advantages of both. Tightly coupled systems realize communication between processors through shared memory, and the connection between processors is relatively tight. In loosely coupled systems, each processor node has memory [5], and processors communicate with each other through message passing. Each node in this system is a complete DSP processor with SDRAM memory, making it a loosely coupled system. However, all nodes share a single synchronous 4-port SRAM memory, forming a single computing resource, making it a tightly coupled system. Therefore, this system combines the advantages of both tightly coupled and loosely coupled systems, offering enhanced availability and better performance compared to the former two. 2.3 Synchronous 4-Port SRAM Channel Partitioning The 128 KB synchronous 4-port SRAM is divided into 7 regions (see Figure 3). Except for one common region, the remaining 6 regions are used for communication between DSPs. Based on the characteristics of the synchronous 4-port SRAM, these 6 regions can be used simultaneously, creating independent "channels" for communication between DSPs. These channels are independent, do not interfere with each other, and can be used simultaneously. The synchronous 4-port SRAM operates at a bus frequency of 133 MHz, with a data width of 16 bits and a bandwidth of 266 MB. Due to the symmetry of the design, the results are the same regardless of whether the ping-pong method or the hot potato method is used to measure the point-to-point communication overhead. 2.4 System Working Principle and Performance Analysis Digital video data is stored in a video buffer FIFO, with a speed of up to 266Mb/s. Under the action of the DMA controller of the DSP-1, the data in the front-end data buffer FIFO is continuously transferred to the synchronous four-port SRAM. Then, each DSP reads the data to be processed individually or simultaneously. Because the front-end FIFO and the synchronous four-port SRAM are both connected to the independent interface of the DSP-1, the data allocation process will not interfere with the execution of the DSP-1's algorithm itself, nor will it interfere with the DSP-1's read and write operations on its external SDRAM memory. Each DSP collaborates to complete the entire image processing algorithm. During this process, there may be communication or data exchange between them, which is also accomplished through the synchronous four-port SRAM. During initialization, each DSP downloads its program to its respective code space and data space; after data processing is completed, the processing results are continuously sent out through the PCI bus. In addition, the system has sufficient expansion interfaces to facilitate further system expansion. This parallel image processing system uses four TI high-end digital signal processors, TMS320C6416. The single digital signal processor TMS320C6416 has a frequency of 600MHz, a processing power of 4800MIPS, and a local SDRAM of 32MB. The current 4×DSP system has a processor performance of up to 19200 MIPS and a total SDRAM of 128 MB + 128 KB. In addition, consider speedup and efficiency [6-7]. Speedup refers to the number of times that a parallel algorithm is faster than a serial algorithm for a specific application; efficiency of a parallel system refers to the ratio of speedup to the number of processors. According to Amdahal's law [4], speedup increases with the number of processors, but there is a limit, and this limit is determined by the problem itself, because the overhead increases with the number of processors. For a 1024×2048 pixel image, with each pixel being 1B, the FFT operation time is 82,715.020 ms for a single processor and 20,703.770 ms for four processors, resulting in a speedup of 3.995 and an efficiency of 99.88% for the parallel system. This demonstrates a significant performance improvement. With the rapid development of digital signal processors, image processing algorithms are becoming increasingly complex, and architectures with multiple DSPs working in parallel will be increasingly adopted and widely applied. To meet the demands of increasingly complex image processing algorithms and ever-increasing image sizes, a general-purpose, high-performance parallel image processing system was designed using four TMS320C6416 chips. This system can quickly complete tasks that previously required a single computer to complete for a long time. This system can serve as a general-purpose video detection hardware platform, implementing various detection algorithms, and possesses excellent scalability, facilitating secondary development. Experimental and application results show that the system can calculate traffic information parameters in real time and achieve network transmission of images and data, demonstrating powerful video processing capabilities and network functionality. In summary, this solution is flexible and simple, meets real-time requirements, and has been proven in practice to be applicable to traffic flow detection systems to improve the overall performance of the system.

Design of a Video Traffic Detection System Based on a Quad-Core DSP

Read next

CATDOLL Katya Soft Silicone Head

CATDOLL Q 108cm Tan Tone

CATDOLL Dudu Soft Silicone Head

CATDOLL 139CM Sasha Silicone Doll