Redundancy design using ARM processors in DCS controllers
2026-04-06 07:22:46··#1
In many areas of automation, the demands for efficiency are increasing, leading to higher requirements for the fault tolerance of automation systems, especially in situations where equipment downtime is extremely costly. To meet these stringent requirements, redundancy techniques are commonly employed in DCS systems to satisfy the safety, reliability, and efficiency standards required in these fields. In DCS control systems, distributed processing units (DPUs) are a critical component. Currently, the controllers of these DPUs are often based on the 86 series CPU, an architecture that is widely used. However, due to the characteristics of the 86 series DPUs, current controllers have many drawbacks, such as high heat generation. Moreover, these problems remain difficult to solve in enclosed environments. With the development of many low-power technologies and the emergence of low-power controllers, there are more options for the CPUs of DPUs, especially low-power controllers that generate very little heat and allow for higher system integration. Regardless of the differences between domestic and international DCS control systems, redundancy techniques are widely used. DCS control systems primarily employ network redundancy, distributed processing unit redundancy, I/O card redundancy, and power supply redundancy techniques. 1. DCS System Model The DCS control system is a product of the combination of control technology, computer technology, communication technology, and graphics display technology. Its general architecture is shown in Figure 1. As shown in Figure 1, the DCS control system model block diagram, DPU refers to the Distributed Processing Unit. The DCS control system architecture is divided into three layers: process control layer, monitoring layer, and management layer. The process control layer is the foundation of the DCS system control; its main functions include control interfaces, field control units, detection instruments, and actuators. The monitoring layer mainly consists of a monitoring computer, advanced operator stations, and interface devices, primarily responsible for operation monitoring, system alarms, trend display, and system diagnostics. The management layer mainly consists of a management computer, responsible for the management of the entire system. Most of the redundancy settings in the DCS control system are related to the process control layer and the monitoring layer; therefore, most redundancy technologies are related to the process control layer. The following discussion focuses on distributed processing unit redundancy and network redundancy in the DCS control system, using the latest ARM controller-based distributed processing unit. 2. ARM-Based Distributed Processing Unit Structure The block diagram of the ARM-based distributed processing unit is shown in Figure 2, which is very similar to the structure of a typical distributed processing unit. The processing unit is mainly divided into six parts: ARM controller, upper-layer network module, power management module, memory module, lower-layer network module, and clock module. The ARM controller and memory module form the most basic embedded system, where data processing and control strategies for the entire DCS control system are handled. The upper-layer network module mainly refers to the primary/redundant network communicating with the management layer and a network enabling data exchange between the primary/redundant distributed processing units. Its main function is to enable real-time monitoring of the DCS control system by the management layer and to achieve data redundancy between the primary/redundant distributed processing units. The lower-layer network module consists of two 485 networks, one primary and one secondary. This module mainly realizes data communication between the I/O cards of the ARM controller. The memory module can be divided into two parts: one for managing and storing the operating system, and the other for managing and storing real-time data. The power module and clock module respectively manage power and clock power. 3. Controller Redundancy Redundancy of the distributed processing unit can be divided into two cases based on redundancy degree: 1:1 and 1:n. This varies in each DCS manufacturer's products. Each method has its advantages and disadvantages. This section introduces the 1:1 redundancy setting. The distributed processing unit redundancy is 1:1, meaning two identically configured controllers are set up in redundant mode. During operation, in case of a fault, the two distributed processing units can seamlessly switch over, ensuring normal system operation. Each distributed processing unit in this system has three network interfaces. Two interfaces connect to network A and network B respectively, enabling data communication between workstations. The third interface connects the sub-distributed processing units via a peer-to-peer network (optical fiber can be used as the transmission medium) to transmit backup data between the master and slave distributed processing units. Status information between distributed processing units is transmitted via serial port. The connection between the redundant distributed processing unit and the network is shown in Figure 3. The working principle of the redundant distributed processing unit: The two distributed processing units have identical hardware configurations, containing the same operating system, configuration software, and configuration information. At any given time, only one distributed processing unit receives process information through a dual-machine switching card, performs calculations, and finally generates a control structure to control the process equipment. This distributed processing unit is called the master distributed processing unit. Meanwhile, the redundant distributed processing unit is not idle. In each work cycle, it copies the process information and calculation results received by the main distributed processing unit in real time via the optical fiber between the two units, ensuring that the redundant distributed processing unit is always synchronized with the main distributed processing unit. The redundant distributed processing unit also checks whether the copied information is complete and within the allowable fault tolerance range. If the information is incomplete or an error occurs, it indicates that the main distributed processing unit is malfunctioning. In this case, the redundant distributed processing unit bypasses the main distributed processing unit via the optical fiber and simultaneously switches to become the main distributed processing unit. The switching time uses the shortest possible calculation cycle (tens of milliseconds or less) to achieve a smooth switching process. Once the main distributed processing unit recovers, the redundant distributed processing unit automatically returns control to the main distributed processing unit, switching to redundant backup mode. Figure 3 shows that the arbitration circuit of the dual distributed processing units works by adding the diagnostic results of each distributed processing unit to the arbitration circuit and reading the corresponding arbitration results to determine its primary/secondary status. By programming, the priority of each state is determined. Since both distributed processing units determine their primary and secondary states based on the arbitration result, they automatically switch whenever the arbitration result changes. 4. Network Redundancy In a DCS control system, at least two networks exist: one is the communication network between the management layer and the control layer, and the other is the RS-485 communication network between the process control layer and the lower I/O card layer. The upper-layer network primarily enables the management layer to monitor the control layer in real time and configure the lower-layer controllers. The lower-layer network transmits data collected by the I/O cards to the process control layer, and then transmits control data from the control layer to the I/O cards. The importance of these two networks in the DCS control system is self-evident. To address the redundancy of the upper-layer network modules, various industrial Ethernet switches supporting link redundancy have emerged in today's industrial automation field, solving the problem of network paralysis caused by node failures. To improve the overall reliability and fault tolerance of data communication, this DCS control system also adopts a ring network topology. However, since the inherent disadvantage of ring networks is that node failures cause network-wide failures, redundancy technology for data communication links has emerged. To ensure the smooth operation of the Layer 2 network, both networks are configured with 1:1 redundancy. When the distributed processing unit detects a failure in the main network, it automatically abandons communication from the main network port and simultaneously enables the port of the redundant network, transferring data to the redundant network for sending and receiving. At this time, the redundant network enters the data communication state, and the main network is repaired in time. The dual-ring network technology realizes the redundancy of the transmission medium and further improves the reliability of the system based on TurboRing. However, the management and switching of dual-ring networks are relatively complex, and this system adopts a simple method. For each node on the network, it may be in one of the following 5 situations: (1) Online network operation, the system has at least 2 nodes running; (2) Online stand-alone operation, the system has only the local node running; (3) Offline stand-alone operation, the network card of this node exists and is normal, but it does not access the Internet; (4) Offline stand-alone operation, the network card of this node exists, but it is faulty; (5) Offline stand-alone operation, the network card of this node does not exist. Current network controllers provide command, diagnostic, configuration, and status registers. By reading and writing these registers, the five situations mentioned above can be distinguished. During system operation, various factors can occur. To reflect the operational status of each network node online, each node can set up a network status table to record the operational status of each network interface card (NIC) on that node. When the status of one or two NICs on a node changes, other nodes should be able to know as quickly as possible. Therefore, each node's two NICs need to periodically broadcast a test packet indicating the presence of the NIC. When other nodes receive this packet, they modify the status of the corresponding NIC on that node in their network status table. However, during operation, if one NIC on a node goes offline, it will neither receive nor send. Since it has already been registered in the network status tables of other nodes, they will consider the NIC to be present and functioning normally, which obviously does not reflect the true operational status of the NIC. To accurately reflect the NIC status, each node's two NICs periodically broadcast test packets while simultaneously incrementing the status count of all NICs in its network status table by 1, until the maximum value LIMIT is reached. Whenever a test packet is received from a node, the status count of the corresponding network interface card (NIC) of that node is reset to 0. This ensures that the status count is less than the maximum value (LIMIT) and the scheduled broadcast period allows for real-time online monitoring of network operation, accurately reflecting the node's status. An inter-network transmission device with routing capabilities can also be added between the two networks. When both networks fail simultaneously, the inter-network transmission device can automatically find a feasible path to form a loop, maintaining normal system communication. For broadcasting test packets, as long as the NIC on this node is online, the packet is sent from the corresponding link. After the test packet is sent, based on the network status table, a link with fewer normally operating nodes and a lighter load can be selected for packet transmission. The workflow of the dual-network system is shown in Figure 4.