DCS maintenance involves a lot of complex tasks, so it's easy to feel lost when a fault occurs. Common faults can generally be addressed from the following six perspectives.
Three roses and green vines divide the line.
I. Communication network failure
Communication network faults are generally prone to occur at the contact bus, local bus, or due to incorrect address identification.
Node bus failure
The transmission medium of a node bus is typically coaxial cable. Some use token signal transmission, while others use a multi-path contention bus signal transmission method with collision detection. Regardless of the method used, an interruption in any part of the bus trunk will cause communication failure for all stations and their sub-devices on that bus.
Currently, the common method to prevent such faults is to use a dual-redundant configuration to avoid affecting the entire system due to a failure in one bus. However, this does not fundamentally prevent faults from occurring, and once one bus fails, troubleshooting can easily cause another bus to fail, with very serious consequences. An effective approach should start by preventing poor bus contact or open circuits.
A particularly successful aspect is the system's node bus layout. The coaxial cable connection is not before the communication module, but after it. This prevents accidental contact with the coaxial cable and potential network cable breakage when troubleshooting communication module malfunctions during system operation. Furthermore, the coaxial cable is never touched except for specific inspections, preventing loosening caused by repeated plugging and unplugging and increasing the likelihood of failure. Additionally, a coaxial cable inspection and replacement management system should be established to replace or address the issue before its contact resistance increases to the point of affecting communication.
Local bus failure
Local bus or fieldbus is a data communication network typically composed of twisted-pair cables. Because the devices it connects are primary components or control equipment that directly interact with the production process, the working environment is harsh, the failure rate is high, and it is easily affected by accidental tampering by maintenance personnel, which can disrupt the production process. In addition, the bus itself can also experience communication failures due to various reasons.
Effective measures to prevent such failures include: firstly, properly handling the connection points between the local bus and local devices. When installing or removing equipment, the normal operation of the bus must not be affected; bus branches should be installed in locations that are not easily touched; and ideally, the bus should employ dual-path redundancy to improve communication reliability.
Incorrect address label
Whether it's a local component or a bus interface, an incorrect address will inevitably cause communication network disruption. Therefore, it's crucial to prevent incorrect address identification of components and to prevent accidental human intervention or modification. System expansion should generally be performed when the system is stopped. Especially for systems using token-based communication, any addition or removal of components must be announced to the network while the system is offline to avoid unforeseen consequences.
II. Hardware Failure
Based on the different functions of each hardware component, DCS system faults can be categorized into human-machine interface (HMI) faults and process channel faults. HMIs primarily refer to engineering workstations, operator workstations, printers, keyboards, mice, etc., used for human-machine interaction. Process channels mainly refer to local buses, channels, process processors, primary components, or control equipment. An HMI consists of multiple workstations with identical functions; if one fails, timely handling generally will not affect the system's monitoring operations. However, process channel faults occurring on the local bus or in primary equipment directly impact control or detection functions, thus having more serious consequences.
Human-machine interface failure
Common human-machine interface (HMI) malfunctions include mouse operation failure, control operation failure, operator station crashes, membrane keyboard malfunctions, and printer malfunctions. Mouse operation malfunctions are generally due to aging or contamination of internal mechanical components from long-term use, causing unreliable contact switching, or a loose cable connection resulting in a lack of communication with the host computer. In these cases, simply replacing and inspecting the mouse will resolve the issue.
Control operation failure occurs because mouse operation signals cannot change the state of the process channel. This could be due to a hardware malfunction in the process channel itself, or a software defect in the operator station, causing it to become unresponsive when the equipment is overloaded or too many process windows are open. After confirming that the process channel is functioning normally, the operator station should be checked and restarted and initialized if necessary.
There are many reasons why an operator station might crash, such as hard drive or card failure or software defects.
A malfunctioning cooling fan can cause the computer to overheat or be overloaded. First, check the temperature rise of the computer itself. Then, use the substitution method to check the hard drive, computer cards, etc., to pinpoint the faulty component.
Membrane keypads are used in most operator stations. Their main function is to quickly retrieve process graphics, allowing operators to rapidly monitor process parameters. Malfunctions can occur due to incorrect keypad configuration, poor keypad contact, loose signal cables, or incomplete startup caused by accidental keypad movement during host operation. Different issues require different solutions.
A printer malfunction is usually due to configuration issues. Additionally, disabling the printer can also prevent printing. Furthermore, hardware failure can cause some or all of the printer's functions to malfunction. The printer settings and hardware should be checked and addressed accordingly.
Process channel failure
The most common problems in process channels are card failures or local bus failures. One reason is that the cards themselves have been in operation for a long time, leading to aging or damage of components. Another reason is that external signals grounding or strong signals entering the card can also cause channel failures. Nowadays, most cards have good isolation measures, so the fault usually won't escalate. However, once such faults occur, they directly cause abnormal process control or monitoring functions. Therefore, it is crucial to promptly identify the cause of the fault and replace the card in a timely manner.
Sometimes, a malfunction in a component or control device may not be immediately detected by the operator; it only draws attention when parameters become abnormal or an alarm sounds. A malfunction in the control processor (process processor) will generally generate an alarm immediately, alerting the operator. Modern control processors are almost always configured with 1:1 redundancy; a malfunction in one processor will not cause serious consequences, but the malfunctioning machine should be addressed immediately. During the troubleshooting process, it is absolutely crucial not to accidentally operate a functioning processor, as this could lead to severe consequences.
III. Human error
Human error can sometimes occur during system maintenance or troubleshooting, even for personnel who frequently perform system maintenance or are new to system repair and maintenance. Modifications to control logic, software installations, equipment restarts, or forced equipment/protection signals are particularly prone to causing malfunctions. These can range from minor issues like abnormalities at some monitoring points or on some equipment to serious consequences like unit or major auxiliary equipment shutdowns, with very severe consequences. In chemical plants, malfunctions caused by human error account for a large proportion of unsafe incidents.
IV. Power supply failure
There are also many power supply issues, such as the backup power supply not being able to automatically switch on, unreasonable fuse configuration, and internal power supply failures causing power outages. Temperature and pressure fluctuations can cause protection malfunctions, and poor contact of connectors can lead to no output from the temperature and pressure power supply. In some systems, the entire cabinet is supplied with all input signals through one fuse or one power supply with a large external load. In other systems, the control power supply is neither connected nor has any redundancy.
5. SOE malfunction
SOE (Sequence of Events) recording is a record of events that occurs when power equipment experiences a remote signaling change, such as a switch change. Power protection equipment or smart meters automatically record the change time, cause, and corresponding telemetry values (such as the corresponding three-phase current and active power) at the time of switch tripping, forming an SOE record for later analysis. Many relay protection devices and smart meters, such as those from GE Power, Schneider Electric, ABB, and Siemens, as well as dedicated power RTU equipment, have SOE recording capabilities.
The SOE (Sequence of Events) conclusions play a crucial role in accident analysis and judgment. However, in reality, many power plants fail to record the SOE's timing when protection actions occur, or the recorded time does not match the actual situation. For example, in Unit #1 of the power plant, there were issues such as the SOE event sequence recall time not corresponding to the actual trip time, the inability to return after browsing the SOE time printout, the failure to reflect the cause of the first trip in the time sequence, and the inability to set the SOE time sequence data. Furthermore, some power plants have found discrepancies between the timing sequence in the SOE conclusions and the timing sequence in historical curves during several accident analyses, sometimes even with the timing sequence reversed. Specifically, the timing of the same point in the historical curve and SOE is inconsistent, sometimes with significant deviations. This delays the accident analysis process and can even mislead the direction of the accident analysis. SOE problems are related to both unreasonable system design, where SOE points are not fully concentrated on a single point, and inadequate consideration of system hardware and software.
VI. Faults caused by interference
There are also many examples of malfunctions caused by interference. The interference signal of the system may come from the system itself or from the external environment. Since different systems have strict requirements for grounding, if the grounding resistance or grounding method does not meet the requirements, it will reduce the efficiency of network communication or increase the possibility of bit errors. This can cause some functions to malfunction or even lead to network paralysis.
Power quality also affects the stable operation of the system. The power supply used for the system must not only ensure voltage stability but also guarantee a seamless switch to another power supply in the event of a power supply failure; otherwise, it will interfere with the system's operation. Switching between the main and backup process control processors can also sometimes cause interference. In addition, high-power wireless communication equipment such as mobile phones and walkie-talkies are highly likely to cause interference during operation, endangering system operation.