PCI bus-based embedded real-time image processing system

1. Current Status of Image Processing System Development Image information processing and its results play a crucial role in computer information processing and applications. The development of image processing relies on the application and development of processor chip technology (including microcontrollers, DSPs, etc.) and the emergence of large-capacity, low-cost memory. Although image processing systems have evolved from large, chassis-based structures to miniaturized, plug-in card structures, the large amount of data involved in image processing generally cannot meet the real-time and capacity requirements of most real-time processing scenarios. This is mainly reflected in the following two aspects: 1.1 Real-time Requirements Image processing systems can be implemented in many ways, such as software implementation on a general-purpose computer, implementation using a microcontroller, or implementation using a special-purpose DSP chip. However, these methods all have drawbacks. For example, software implementation is too slow and cannot be used in real-time systems; microcontrollers use a von Neumann bus architecture, and multiplication operations are too slow; using special-purpose DSP chips lacks flexibility, and development tools are not very mature. In this system, the image processing system is implemented by adding a DSP acceleration card to a general-purpose computer. Most image processing tasks are completed using the high-speed processing capabilities of DSP chips, with the computer serving only as an auxiliary operation and storage system. This method leverages the high-speed performance of DSPs while also offering considerable flexibility, and the development tools are relatively comprehensive. Achieving real-time performance also requires sufficient transmission speed; the PCI bus speed can reach up to 132MB/s, a speed unmatched by other buses, such as the ISA bus, which only reaches 5MB/s. 1.2 Miniaturization Requirements Modern image processing applications are increasingly trending towards miniaturized systems, integrating the entire processing system into a small "black box," or even onto a single circuit board. This necessitates image processing systems with high-speed, highly integrated processing chips to perform tasks that previously required numerous components. The central processing unit characteristics of DSPs allow for the integration of many functions into one unit, and complex functions can be implemented through external expansion, achieving system miniaturization. Furthermore, a significant characteristic of video data is its large capacity, one of the biggest differences from audio data. Real-time processing of large amounts of data requires not only a high-speed CPU but also a large, expandable storage space. When implemented in software on a general-purpose computer, its storage space is determined by the computer's storage space, limiting its expandability; similarly, the expandability of a microcontroller is limited. Neither can meet the large capacity requirements of video data. This paper focuses on the real-time and miniaturization requirements, designing an image acquisition system implemented using a DSP accelerator card. The high-speed processing capabilities of the DSP chip are utilized to complete most of the image processing work, while the host computer only handles auxiliary operations and the storage system. This method leverages the high-speed performance of the DSP while offering considerable flexibility, and the development tools are relatively complete. Furthermore, the PCI bus speed can reach up to 528MB/s (66MHz, 64-bit), a speed unmatched by other buses. The C6000 series DSP in the system has an expandable storage space of up to 1GB, which is sufficient for the needs of a typical image processing system. 2 DSP Image Processing System Structure This system uses the TMS320C6211 from the TI C6000 series DSP as its CPU. Image data is input as analog image signals via a video head. These signals are converted into digital signals by an A/D converter via a video decoding chip, then input to a DSP via a FIFO for image enhancement, segmentation, feature extraction, and data compression. The output signal is then converted into a standard signal conforming to the PCI bus specification by a PCI decoding chip and transmitted to the host computer via the PCI bus interface. This system is designed for intelligent building management, primarily to achieve real-time detection of some important building parameters. It can also be used in other monitoring systems; the hardware system is basically the same, the difference lies in the software functions. The system structure diagram is shown in Figure 1. [img=450,164]http://www.e-works.net.cn/ewk2004/fileupload/images/127465331654062500.gif[/img] As can be seen from Figure 1, the entire system can be divided into three modules: a DSP image processing module, a video decoding module, and a PCI bus module. The implementation process and functions of these three modules are described below. 2.1 DSP Image Processing Module This system uses the TMS320C611 from TI's C6000 series DSPs as its CPU. The C6000 is a new high-speed digital processing chip released by TI for video processing, suitable for applications requiring high speed and high intelligence, such as mobile communication base stations, image surveillance, and radar systems. For storage, two HY57V651620B chips from Hyundai are used as SDRAM during operation, and an AM29LV800B chip from AMD is used as FLASH memory to load programs and parameters during startup. Its structure is shown in Figure 2. In the figure, HPI (Host Port Interface) is the host port; EMIF (External Memory Interface) is the external memory interface, compatible with synchronous/asynchronous modes. [img=450,226]http://www.e-works.net.cn/ewk2004/fileupload/images/127465331850781250.gif[/img] (1) TMS320C6211 Processor Characteristics and Functions The TMS320C6211 processor consists of three main parts: CPU core, peripherals, and memory. Its high-speed performance is mainly reflected in the following aspects: ① The maximum memory space of TMS320C6211 can be expanded to 512MB, which can fully meet the memory space required by various image processing systems. Moreover, its maximum clock speed can reach 167MHz, and its peak performance can reach 1600MIPS (million instructions/second) and 2400MOPS (million operations/second). ② Parallel processing structure. The TMS320C6211 chip has 8 parallel processing units, divided into two identical groups. The parallel structure breaks through the traditional design, making the chip have high performance. ③ The chip architecture adopts the VelociTI structure. VelociTI is a high-performance Very Long Instruction Word (VLIW) architecture with a single instruction word length of 32 bits. Eight instructions form an instruction packet, with a total word length of 256 bits, meaning it can execute eight instructions per second. The chip also has a dedicated instruction allocation module that can simultaneously distribute each 256-bit instruction packet to eight processing units for concurrent execution. ④ Pipeline operation is used to achieve high speed and efficiency. The TMS320C6211 can only achieve its highest MIPS when the pipeline is fully utilized. The C6211 pipeline consists of three stages: instruction read, decoding, and execution, totaling 11 stages. The DSP's main function is to process the data read from the FIFO, including program-based recognition, feature extraction, and parameter detection. When the video head captures images at a rate of 25 frames per second, the maximum time left for the DSP to process is 40ms per frame. If we consider the system's delay and the storage time of the processed image, the DSP cannot process an image for more than 30ms. According to the processing speed of C6211, it can process 36M instructions within 30ms. The DSP reads the row data from the FIFO and stores it in the SDRAM. A frame of image has 576 rows. The frame interrupt will be received at the last row. At this time, the total image data in the SDRAM is 1440×576＝810KB. It is more than enough for C6211 to process 810KB of data in the time of 36M instructions. (2) The SDRAM memory HY57V651620B is a 128Mb SDRAM released by Hyundai. It supports paging addressing and has an access speed of 7.5ns. It works synchronously with the DSP system. However, since the DSP works at 150MHz, the SDRAM in the system works at half or 2/3 of the DSP clock, that is, 75MHz or 100MHz. When operating at 100MHz, SDRAM requires some special timing control and is not simply synchronized with the DSP. SDRAM mainly stores temporary data and processing results during system operation. The system has a total of 256Mb of storage space, and its consumption depends on the size of the startup program and the image processing program. (3) FLASH memory AM29LV800B is an 8Mb FLASH chip released by AMD. It also supports paging addressing and operates in asynchronous mode. The startup program is embedded in the FLASH chip. When the system is powered on, the program in FLASH is downloaded to the SDRAM of the DSP for execution. FLASH has an online rewrite function, which greatly facilitates the modification and upgrading of the system startup program. 2.2 Video decoding module The video A/D of this system uses Philips' SAA7111A as the video decoding chip. SAA7111A can provide four analog video inputs, has two analog processing channels, and supports four CVBS analog signals or two Y/C analog signals or (2×CVBS and 1×Y/C). The SAA7111A performs A/D conversion on the standard PAL format analog image signal input from the video head, and then outputs 16-bit YUV data conforming to the CCIR601 recommended format in a 4:2:2 format to the FIFO. The luminance signal (Y) is 8 bits, and the chrominance signals (Cr and Cb) are combined into 8 bits. The FIFO uses an IDT72V215LB chip from IDT, with a depth of 512×18. According to the CCIR601 standard, the YUV image resolution is 720×576. Therefore, when outputting line by line, the SAA7111A outputs a data stream of 720×16 = 1440 bits. Because the DSP communicates with the FIFO through a 32-bit SBSRAM interface, ping-pong switching is required between the FIFOs when writing YUV data. In this case, a line of 720×16 data becomes 360×32 when stored in two FIFOs. 2.3 PCI Communication Module PCI (Peripheral Component Interconnect) can be used as both an intermediate layer bus and a peripheral bus system. Compared to other common bus specifications, the PCI bus provides better support for high-speed I/O devices, such as graphics adapters, network interface controllers, and disk controllers. The current standard allows 64 data lines at 33MHz, with a transmission rate of up to 2.12Gbps. Furthermore, the PCI bus supports linear burst data transmission, ensuring the bus is constantly fully loaded with data, thus effectively utilizing bus bandwidth. In addition, the PCI bus features low-latency random access, with a write latency of 60ns from the master register to the slave register. The appeal of the PCI bus specification lies not only in its high speed but also in its adaptation to the system requirements of modern I/O devices. It allows for fully automatic configuration of PCI expansion cards and devices, and requires minimal interface logic to implement and support other bus systems. The HPI port of the TMS320C6211 does not support seamless PCI bus interface. This system uses TI's PCI2040 to connect the DSP's HPI port to the PCI bus. Data processed by the DSP is output to the PCI2040 via the HPI port for decoding, and then output to the PCI bus. Its logic structure is shown in Figure 3. [img=450,350]http://www.e-works.net.cn/ewk2004/fileupload/images/127465332136093750.gif[/img] The PCI2040 is a dedicated chip designed by TI specifically for interfacing C5000 and C6000 series DSPs with the PCI bus. The PCI2040 conforms to the PCI Local Bus 2.2 specification, enabling seamless connection between the PCI bus and the HPI port of the TMS320C54X or TMS320C6X DSP. The PCI2040 is compatible with both 3.3V and 5V to accommodate different PCI bus voltages. No signal level conversion or additional control logic circuitry is required between the PCI2040 and C6211, making the interface circuitry very simple. In this system, the PCI2040 uses two voltages: 5V and 3.3V. 3.3V is the HPI port voltage, while 5V is the PCI bus voltage. The PCI2040 requires preloading of its PCI bus register and HPI register parameters during startup. The system's PCI decoding module includes a configuration ROM—AT24C08A, an EEPROM type ROM, facilitating modification and upgrades of configuration parameters. Upon system startup, the data stored in the AT24C08A is downloaded to the PCI2040's registers and configured. In the diagram, /HINT[3:0], /HCS[3:0], HRDY[3:0], and HRST[3:0] are connected to the corresponding signals in the four DSPs, meaning the PCI2040 can interface with four DSPs simultaneously. 3. PCI Bus Driver Implementation Method In terms of system software, image processing systems based on the PCI bus face many challenges, the most difficult of which is the PCI driver. To explain how to implement a PCI bus driver under the Windows operating system, it is necessary to understand the PCI device configuration space system. PCI devices have three physical storage spaces: configuration space, memory space, and I/O space. The configuration space is a contiguous space of 256 bytes, as defined in Figure 4. Within the configuration space, the read-only space contains the device identifier, vendor code, version number, classification code, and header type. The vendor code identifies the device vendor; the device identifier identifies a specific device; the version number identifies the device's version number; the classification code identifies the device type; and the header type identifies the header type and whether it is a multi-function device. Except for the vendor code, the values of the other fields are assigned by the vendor. The most important function of the base address register is to allocate the system address space for the PCI device. In the base address register, bit 0 (the least significant bit) is used to identify whether it is the memory space or the I/O address space. When the base address register is mapped to the memory space, bit 0 is "0", and when it is mapped to the I/O address space, bit 0 is "1". 3.1 Drivers in the Windows Environment A driver can be understood as a series of functions that control hardware devices. In the Windows operating system, the method of encapsulating drivers is to create a DLL or VxD. When the hardware is a non-standard device, a device driver for the Windows environment must be designed specifically for that hardware. In the DOS operating system, an application always considers itself the only running program, so the application can directly access the hardware, exclusively occupy all system memory and system runtime, and therefore does not need a device driver. However, in the Windows operating system, several applications may be running simultaneously, making it impossible for the system to allow them to arbitrarily access the hardware directly; otherwise, it would cause application access conflicts and lead to system crashes. To solve this problem, people proposed virtualizing system resources, allowing applications to run on a virtual machine (VM) in a virtualized environment, while the hypervisor and drivers run on the actual machine (Ring 0 level), handling hardware operations. Virtual resources are simulations of hardware (or even software) resources. When a system virtualizes all or almost all resources that a program can access, it creates a "virtual machine" (VM). The Windows virtual machine completely transparently simulates the following resources and performance: (1) accessible memory space (2) I/O operations (3) interrupt operations (4) peripheral devices (monitor, keyboard, etc.). In this way, Windows applications run in Ring 3 (the lowest level) protected mode and cannot directly access hardware. Performing an operation on hardware will cause an exception event. At this time, the processor switches to Ring 0 and hands over control to the corresponding controller. All Windows applications share a single system virtual machine. Windows mainly uses the page fault mechanism to virtualize access to memory-mapped devices. To capture access to the memory of a device, the device VxD driver marks the corresponding memory page in the page table as "unavailable". When a program running in the virtual environment tries to access this page, an error will occur. The VM's exception handler will call the page fault handler of the registered device VxD driver to handle related issues. That is, when the VM accesses the port and an exception occurs, it enters the callback handler of the hardware, which is handled by the function we wrote. 3.2 Common Drivers There are three types of drivers: VxD, KMD and WDM. (1) VxD: Virtual device driver, which is an extension of the system for the identification, management and maintenance of various hardware resources. VxD and VMM together maintain the operation of the system. VxD operates in Ring 0 of Intel series CPU protection mode and has the highest control over the hardware. (2) KMD, Kernel Mode Driver. It is a driver mode proposed by Windows NT for managing and maintaining hardware operation. The driver runs in the Kernel mode of Windows NT (similar to Ring 0). However, the running environment of a KMD is fundamentally different at different times. The running environment when the driver receives a device request may be fundamentally different from the running environment when the device request is actually executed. Therefore, the operation of the driver under Windows NT is subject to many restrictions of Windows NT itself. (3) WDM, Win32 Driver Model. It is compatible with Windows 2000 and is a brand-new driver model strongly promoted by Microsoft. It can essentially be understood as a plug-and-play KMD. WMD code is very long, and Windows 98 has limited support for WDM, so WMD is unlikely to replace VxD in the short term. 3.3 PCI2040 Driver Development PCI2040 is not a standard hardware device, therefore, the PCI2040 hardware installation information file and driver must be written manually. The PCI2040 configuration space is shown in the attached table: [img=321,273]http://www.e-works.net.cn/ewk2004/fileupload/images/127465332311562500.gif[/img] The Device ID is used to identify a specific device, and the specific code is assigned by the vendor; the Vendor ID indicates the manufacturer of the device. These two are unique identifiers for PCI devices. For PCI2040, when reading the Vendor-device ID, the return value is AC60104C. (1) Device Information Installation File The device information installation file (.INF) contains the driver's name, the directory where the driver should be copied, and the registry entries that must be generated and modified during driver installation. When writing the INF file, you can use the INF Editor tool in the VtoolsD development package. Below is the PCI2040 installation information file that I wrote. [Version] Signature=$CHICAGO$ Class=PCI Bridge file:// Device type is PCI bridge device Provider=%String0% file:// Vendor name [ClassInstall] [DestinationDirs] DefaultDestDir=11 file:// Install the driver in file://C:\Windows\System directory [Manufacturer] %String1%=SECTION_0 [SECTION_0] %String2%=sevenstar,PCIVEN_104C&DEV_AC60 [sevenstar] CopyFiles=CopyFiles_sevenstar AddReg=AddReg_sevenstar [CopyFiles_sevenstar] DSP_PCI_Bridge.vxd file:// Driver name is file://DSP_PCI_Bridge.vxd [AddReg_sevenstar] HKR,,DevLoader,0, DSP_PCI_Bridge.vxd file:// Add relevant information to file:// Registry [sevenstar_LogConfig] ConfigPriority=NORMAL IRQConfig=3,7,9,10,15 file://Specify the PCI2040 selectable interrupt port[ControlFlags] [SourceDisksNames] 1=pci2040 driver disk,,0000-0000 [SourceDisksFiles] DSP_PCI_Bridge.vxd=1 [Strings] String0="Texas Instruments" String1="TI" String2="PCI Bridge" (2) VxD Creation Windows supports static installation and dynamic installation of VxDs. The former installs VxDs during Windows initialization and keeps them in Windows, while the latter installs and uninstalls them under the control of an application or other VxDs. Statically installed VxDs always occupy a certain amount of memory resources and interrupt ports. If they are not always running, it will lead to a waste of resources. When developing VxDs using QuickVxD, simply check the "Dynamically Loadable" option. "Device Name" refers to the device name of your VxD; each VxD has a unique device name, here referring to the PCI2040 chip. "Device ID" identifies the device and is only used when the VxD needs to provide an entry point for calling other VxDs. It cannot be arbitrarily chosen and should be provided by Microsoft; generally, "UNDEFINED_DEVICE_ID" is sufficient. "Device Initialization Order" determines the Windows installation order of VxDs. For example, if you want your VxD to be initialized before a VDD, you should set it to "VDD_INIT_ORDER-1," but the default value is usually used. 4. Conclusion The key to implementing an image processing system lies in how to handle the temporary storage, compression, and transmission of large amounts of information. This system effectively solves these three problems. In terms of image information temporary storage, the scalability of DSP storage space is fully utilized to ensure that the amount of information that the system can temporarily store is large enough; information compression is what DSP does best, and a large amount of information compression work can be completed in a very short time; the introduction of PCI bus ensures that information can be transmitted quickly with sufficient bandwidth.

PCI bus-based embedded real-time image processing system

Read next

CATDOLL Luisa Soft Silicone Head

CATDOLL 148CM Sana Silicone Doll

CATDOLL Vivian Hard Silicone Head

CATDOLL 123CM Milana TPE