Industrial control computer software anti-interference technology
2026-04-06 05:04:57··#1
Introduction In industrial settings, various power equipment constantly starts and stops, creating harsh environments and severe electromagnetic interference. Industrial control computers face enormous challenges in such environments. It can be said that the anti-interference capability is a key factor in whether our developed industrial control systems can operate normally and generate the expected economic benefits. Therefore, in addition to carefully designing hardware anti-interference measures for the entire system structure and each specific industrial control computer, it is also necessary to focus on the application of software anti-interference measures. In our years of industrial control research, we have deeply felt that there are too many unexpected factors in industrial settings, and their harm is significant. Sometimes, a single accidental human or non-human interference, such as a relatively mild lightning strike, can render our seemingly impenetrable hardware anti-interference measures ineffective, causing the industrial control computer to crash (i.e., the program malfunctions) or the control to malfunction (at which point the contents of the CPU's internal registers or the data in RAM and I/O ports are modified). This can cause major accidents in certain critical industrial processes. Using software anti-interference measures can, to a certain extent, avoid and mitigate the consequences of these unexpected accidents. Software anti-interference technology utilizes the self-monitoring of the software during operation and the mutual monitoring between machines in the industrial control network to monitor and determine whether the industrial control computer is malfunctioning or failing. This is the last line of defense against interference in the industrial control system. 1. Structural Characteristics and Interference Pathways of Industrial Control Software While the functions performed by industrial control software differ in different industrial control systems, their structure generally has the following characteristics: * Real-time: Some events in industrial control systems are random, requiring the industrial control software to handle random events promptly. * Periodicity: After completing system initialization, the industrial control software enters the main program loop. During the execution of the main program, if an interrupt request is made, the main program loop continues after the corresponding interrupt service routine is executed. * Dependency: The industrial control software consists of multiple task modules that work together, are interconnected, and interdependent. * Human Intervention: The industrial control software allows operators to intervene in the system's operation and adjust its operating parameters. Ideally, the industrial control software can execute normally. However, under the interference of the industrial environment, the periodicity, correlation, and real-time performance of industrial control software are disrupted, causing programs to malfunction and leading to the loss of control of the industrial control system. This manifests as follows: * Changes in the program counter (PC) value disrupt normal program operation. The PC value becomes random after interference, causing program execution chaos. Guided by incorrect PC values, the program executes a series of meaningless instructions, often entering a meaningless "infinite loop," causing the system to lose control. * Interference with the input/output interface status disrupts the correlation and periodicity of the industrial control software, causing system resources to be monopolized by a single task module, resulting in system "deadlock." * Increased data acquisition errors. Interference intrudes into the system's forward channel, superimposed on the signal, leading to increased data acquisition errors. This phenomenon is particularly severe when the sensor interface of the forward channel receives a low-voltage signal input. * Changes in the RAM data area due to interference. Depending on the channel of interference and the nature of the interfered data, the damage to the system varies. Some interferences cause numerical errors, some cause control failures, some alter program states, and some change the operating states of certain components (such as timers/counters, serial ports, etc.). * Control State Failure. In industrial control systems, the output of the control state is often determined by the input of certain conditional states and the logical processing results of those conditional states. In these stages, interference can cause errors in the conditional states, leading to increased output control errors or even control malfunctions. 2. Self-Monitoring Method in the Real-Time Control Software Operation of Industrial Control Computers The self-monitoring method involves the industrial control computer monitoring its own operating state. A typical industrial control computer CPU has a Watchdog Timer, which uses timer interrupts to monitor the program's running state. The timer's timing interval is slightly longer than the time it takes for the main program to complete one cycle. During the main program's execution, a timer time constant refresh operation is performed once. Thus, as long as the program runs normally, the timer will not experience a timer interrupt. However, if the program malfunctions and fails to refresh the timer time constant in time, resulting in a timer interrupt, the timer interrupt service routine is used to reset the system. In the 8031 application system, as an example of software anti-interference, the specific approach is as follows: * Use the "overflow" signal generated by the 8155 timer as the external interrupt source INT1 for the 8031. Use the 555 timer as the external clock input for the 8155 timer; * The timing value of the 8155 timer is slightly larger than the normal loop time of the main program; * In the main program, refresh the timing constant of the 8155 timer every time the loop occurs; * At the beginning of the main control program, classify and handle whether the automatic recovery is caused by hardware reset or timer interrupt. However, this is not foolproof. For example, ① the watchdog circuit itself may fail; ② the instruction to set the watchdog may be misread due to interference during instruction fetching; ③ after the watchdog "detects" the program crashing, its reset pulse or NMI request signal may be ineffective due to interference, etc. Although the probability of the above factors causing watchdog failure is small, they always exist. On the other hand, a considerable number of industrial control computers do not have watchdog circuits. Therefore, the software self-monitoring method discussed below is imperative. 2.1 Continuously monitor and check whether the program counter (PC) value exceeds the program area. For a computer to function normally, its PC value must be within the program area. If the PC value exceeds the program area, the computer has definitely experienced a program crash. The method to check if the PC value is within the program area is to read the breakpoint address pushed onto the stack when the interrupt is executed, within an interrupt service routine that frequently generates external interrupts. If this address is within the program area, the PC value is considered normal; otherwise, the program has definitely crashed. In this case, the program jumps to the machine's restart or reset entry point, and the machine restarts. Thus, the machine recovers. If a suitable interrupt source is not available, a dedicated timer interrupt or several timer interrupts can be set up. The interrupt service routine checks the PC value for validity, and if an error is found, it immediately jumps to the machine's restart entry point. The time constant of the timer interrupt can be set according to the machine's workload and importance, generally ranging from a few milliseconds to tens of milliseconds. The limitation of this method is that it cannot detect random jumps in the PC value within the program area. That is, although the PC value is disturbed, it does not exceed the program area; instead, it misaligns and randomly concatenates instructions, resulting in inexplicable operations or infinite loops. 2.2 Mutual Monitoring of Main Loop Program and Interrupt Service Routine Each industrial control computer's main loop program and interrupt service routines have certain predictable operating patterns. Therefore, mutual monitoring between the main loop program and each interrupt service routine, and between each interrupt service routine, can be designed. Each monitoring pair defines a RAM unit, and mutual monitoring information is expressed by counting/clearing this unit. For example, the main loop program of a certain industrial control computer has a maximum loop time of 80 ms, and one of its timer interrupts has a time constant of 10 ms. When we arrange for this timer interrupt to monitor the main loop program's execution, we can increment the count of this RAM unit by 1 every 10 ms, while the main loop program clears the RAM unit to zero every loop cycle. Therefore, during normal operation, the count value of this monitoring and counting RAM unit cannot be ≥9. If the 10 ms timer interrupt service routine detects that its count value is ≥9, it knows that the main loop program has been interfered with and run away or entered an infinite loop. It then jumps to the machine's restart entry point to resume operation. This method, if designed properly, is very effective. Our years of experience have shown that the main loop program is most likely to be interfered with and run away, and the shorter the interrupt service routine, the less likely it is to run away. It would be better to design multiple monitoring pairs between the main loop program, the interrupt service routine, and the mutual monitoring between the interrupt service routines. 2.3 Continuously Verify the Correctness of Program Code The real-time control program code of industrial control computers is usually stored in EPROM and is generally not easily rewritten. However, after years of operation, we sometimes find a few individual units with errors. The reason may be chip quality problems or rewriting caused by static electricity, lightning strikes, etc. Program errors will directly cause runtime errors or prevent operation. Verification methods can include cumulative checksum verification or BCH checksum (a CRC check method). When using BCH verification, the redundant bytes added to the group can be concentrated in an EPROM area outside the program area. The verification method involves arranging a verification module within a short and frequently occurring interrupt service routine. This can be designed to verify a portion of the program code in each loop, completing the verification in several iterations; or, when the code is small and the task is easy, it can be verified all at once. If a verification error is found, it should be immediately reported to the industrial control network master station or alerted to the operator via its own alarm mechanism for timely handling. The limitation of this method is that the corrupted program code is not the verification block, and it relies on the premise that the interrupt can still respond normally. Due to the short length of the interrupt service routine, there is usually a high probability that the program code can be self-monitored for correctness. 2.4 Continuously Verify RAM Correctness In real-time control, one of the more serious hazards caused by interference is the destruction of data in RAM. Since RAM stores various original data, flags, variables, etc., if it is damaged, it will cause system errors or prevent system operation. Based on the degree of data destruction, it can generally be divided into three categories: * The entire RAM data is destroyed; * A portion of the data in RAM is destroyed; * Individual data is destroyed. Therefore, it is necessary to frequently monitor the correctness of RAM. In industrial control systems, most of the RAM content is temporarily stored for analysis and comparison, and only a very small portion of the data is not allowed to be lost. In this case, apart from this non-lossable data, most of the remaining content can be corrupted for a short time, at most causing a very short-term fluctuation in the system, which can quickly recover automatically. Therefore, in industrial control software, as long as attention is paid to protecting the few non-lossable data, the commonly used methods are "verification method" and "marking method". These two methods each have their advantages and disadvantages. The verification method is more complicated, but the reliability of error detection is high. The marking method is simple, but it is not difficult to detect errors when individual data in the data table is overwritten. In programming, they should generally be used in combination. The specific approach is as follows: * Set a flag code "0" or "1" at the beginning and end of the important areas of the RAM working area; * Set a check word for the fixed data table in RAM. During program execution, a pre-designed error-checking program is used at regular intervals to check whether each flag code is normal. If it is abnormal, data redundancy technology is used to correct it through an anti-interference processing program. The general design principles of redundant data tables are: * Data tables should be set far apart to reduce the probability of redundant data being overwritten simultaneously. * Data tables should be as far away from the stack area as possible to reduce the possibility of data overwriting due to operational errors. The above-mentioned methods for restoring RAM areas should be selected according to the specific circumstances in different application systems . 3. Mutual Monitoring Method for Real-Time Control Systems In real-time control systems, the concern is whether normal control states can be ensured. If interference enters the system, it will affect various control conditions and cause control output errors. To ensure system safety, the following software anti-interference measures can be taken: 3.1. Software Redundancy For conditional control systems, the single sampling and processing of control output is changed to cyclic sampling and processing of control output. This method has good anti-interference effect against accidental factors. 3.2. Setting the Current Output Status Register Unit: When interference intrudes into the output channel and causes output status corruption, the system can promptly query the output status information of the current output status register unit and correct the erroneous output status in a timely manner. 3.3. Setting a Self-Test Program: Status flags are set in specific parts or memory units within the computer system, and continuous loop testing is performed during operation to ensure high reliability of information storage, transmission, and computation within the system. 4. Other Commonly Used Software Anti-Interference Methods 4.1 Trap Method: Sometimes unexpected interference disrupts the normal operation of interrupts and all programs. In this case, the PC value may be within or outside the program area. To enable it to self-recover and operate normally, the only solution is to widely deploy "traps." Here, "traps" refer to certain types of CPU-provided software interrupt instructions or reset instructions for users. For example, the Z80 instruction RST 38H, with machine code FFH. When the CPU executes this instruction, it pushes the current program counter (PC) value onto the stack and then jumps to address 0038H to execute the program. If 0038H is used as a restart entry point, the machine can resume its normal operation. For example, the reset instruction RST in the Intel 8098 and 80198 series also has a machine code of FFH. When the CPU executes this instruction, it performs a reset operation internally and then starts executing the program from 2080H. Of course, the 80198 series has many more illegal opcodes that can be used as trap instructions. In this case, it is only necessary to arrange the interrupt entry in the interrupt vector unit of a word in 2012H and write an interrupt service routine to handle illegal opcodes. Once an illegal opcode is encountered, fault handling can be performed. The author's many years of experience show that traps need to be set not only in all non-content areas of ROM and all non-data areas of RAM, but also widely distributed among modules in the program area. Once the machine program runs out of control, it will always encounter a trap, and the machine can be immediately saved. 4.2 Repeated Function Setting Method Many functions of industrial control computers are usually set in the initialization program at the beginning of the main program and are not set again afterward. This is not a problem under normal circumstances. However, accidental interference can alter these registers inside the CPU or the function registers of interface chips. For example, modifying the interrupt type, interrupt priority, serial port, or parallel port settings will definitely cause machine errors. Therefore, as long as repeated setting operations do not affect the current continuous operation performance, they should be included in the main program's loop. Each loop can refresh the settings, avoiding accidental unforeseen events. For those repeated setting operations that affect the current continuous operation performance, every effort should be made to reset them. For example, with the serial port, if there is a brief idle period after receiving or sending a frame of information, a judgment should be made and a reset operation should be arranged. 4.3 Important Data Backup Method Some critical data in industrial control computers should have at least two backup copies. When operating on this data, the primary and replicas can be compared. If they change, the cause should be analyzed and a pre-designed method should be adopted for handling. Important data can also be verified using checksum or grouped BCH checksum methods. Using these two methods together is more reliable. 4.4 Software Countermeasures for System Deadlock In industrial control systems, input/output interface circuits such as A/D, D/A converters, and displays are indispensable. These interfaces operate with the CPU using polling or interrupt methods. These devices or interfaces are highly sensitive to interference. Once an interference signal corrupts the status word of an interface, the CPU may mistakenly interpret it as an input/output request and cease its current operation, switching to execute the corresponding input/output service routine. However, since the interface itself does not actually input/output data, CPU resources are occupied by this service routine for an extended period without being released, preventing other tasks from executing and causing a system deadlock. To address this interference-induced deadlock problem in software programming, a "time-slicing" method can be used. The specific steps are: * Allocate the corresponding maximum normal input/output time based on the time requirements of different input/output peripherals. * Add a corresponding timeout judgment program to each input/output task module. When interference disrupts the interface state and causes CPU malfunction, the peripheral's readiness information remains invalid for an extended period. After a certain time, the system will automatically return from the peripheral's service program, ensuring the overall software's periodicity remains unaffected and preventing "deadlock." 4.5 Software Countermeasures for Data Acquisition Errors Depending on the nature and consequences of the interference, different software countermeasures are employed, with no fixed pattern. For real-time data acquisition systems, to eliminate interference signals in sensor channels, active or passive RLC networks are often used in hardware to construct analog filters for frequency filtering. Similarly, CPU computation and control functions can also be used to achieve frequency filtering, performing similar functions to analog filters—this is digital filtering. Many digital signal processing monographs discuss this in detail and can be consulted. With the increasing speed of computer processing, digital filtering will be increasingly widely used in real-time data acquisition systems. In general data acquisition systems, simple numerical and logical operations can be used to achieve filtering effects. Methods include arithmetic mean method, comparison and rounding method, median method, and first-order recursive digital filtering method. See related articles in this section for details. 5. Conclusion There are many other aspects to software anti-interference, such as digital filtering of detected quantities, bad value rejection, and the legality of manual control commands and input setpoints. These are all essential for a complete industrial control system. Related articles can be found in other articles in this section. Industrial control is the most important application area of computers, and also their most challenging application environment. Based on my years of research experience, I believe that the anti-interference performance of industrial control computers fundamentally lies in the hardware structure; software anti-interference is only a supplement. Hardware design should be as perfect as possible, and standards should not be easily lowered to allow software to compensate. Software development, on the other hand, must consider all possible hardware failures and interference, striving to improve the system's anti-interference performance while ensuring real-time performance, control accuracy, and control functionality. It requires meticulous consideration and endowing the software with a high degree of intelligence. Only in this way can the software be perfect. Only by organically combining hardware and software can a complete industrial control system that can withstand long-term field testing be achieved.