Reliability of Distributed Control Systems (DCS)

Keywords: Reliability, System, Failure, Control System, Measures, Decentralized, Improvement, Software, Problems The main function of a Distributed Control System (DCS) is to control, monitor, manage, and make decisions about the production process. Therefore, it must have high reliability to ensure the safe and economical operation of the factory. To achieve this, many measures to improve reliability are adopted in distributed control systems. I will mainly discuss the general concept of reliability, reliability analysis methods, reliability measures adopted in distributed control systems, and software reliability issues. With the continuous development of large-scale computer systems and international computer communication networks, reliability has become a very important issue. Reliability theory is also constantly developing and improving under this situation. The research content of reliability technology can be roughly divided into four aspects: reliability design, reliability analysis, reliability testing, and reliability management. Reliability design aims to design and manufacture highly reliable and durable products according to specific technical requirements. Reliability analysis involves collecting, analyzing, and calculating relevant data to derive evaluations and conclusions about reliability issues. Reliability testing is a means of verifying whether the system's reliability meets specified indicators; it can expose potential problems in the system design. Reliability management focuses on improving the overall system reliability from a management perspective, such as establishing reasonable maintenance cycles, equipping the system with appropriate spare parts, and assigning sufficient maintenance personnel. I. Reliability Measures in Distributed Control Systems Distributed control systems employ numerous technical measures to improve reliability. These measures are based on four fundamental principles: first, to make the system itself less prone to failure, i.e., fault prevention; second, to minimize the impact of system failures, i.e., fault protection and fault mitigation; third, to allow the system to continue operating when a failure occurs, i.e., fault tolerance; and fourth, to allow maintenance without stopping system operation when a failure occurs, i.e., online maintenance. Based on these four fundamental principles, various reliability measures are employed in distributed control systems. 1) Strict Quality Management and Improved System Hardware: Hardware is the material basis for the normal operation of a system and a key factor affecting its reliability. Therefore, improving the Mean Time Between Failures (MTBF) is an important measure to improve system reliability. To achieve this, manufacturers of distributed control systems have taken many measures. 1. Strict Screening and Aging of Components: Screening involves eliminating components that do not meet the usage conditions using appropriate methods. Aging involves placing components under certain operating conditions before use to gradually stabilize components that may experience parameter drift. 2. De-derating of Components: Electronic components have certain usage conditions, expressed by certain rated parameter values. Practice has shown that when the operating conditions of components are below their rated values, their operation is more stable and the chance of failure is lower. Therefore, to improve reliability, components are often de-dated. The extent of derating must be considered from both reliability and economic perspectives, because the higher the rated parameters of a component, the higher its price. 3. Fully consider the impact of parameter changes: The circuit design fully considers the impact of parameter changes on components during use, ensuring normal operation under various adverse conditions. 4. Use low-power components: Low-power components generate less heat and have a relatively low failure rate. Furthermore, the widespread use of low-power components significantly reduces the power supply load and improves power supply reliability. 5. Employ noise suppression technology: In industrial control environments, various interference pulses are often the cause of hardware failures in control systems. Therefore, using noise suppression technology is an effective way to improve system reliability. 6. Environmentally resistant design: The system hardware design fully considers the impact of various environmental factors, employing appropriate cooling, shock absorption, dustproofing, and corrosion prevention technologies to improve the system's ability to withstand external environmental attacks. 2) Ensure a safe state during system failures: 1. Limit the scope of the failure: The system continuously performs online fault detection during operation. Once a fault is detected, the faulty device is isolated from the system to prevent it from affecting the normal operation of other equipment. 2. "Freeze" CPU output: If the system detects a CPU fault, it immediately "freezes" the output information of the control system to avoid output chaos. 3) Adopting backup measures 1. Manual backup For important control loops, manual backup can be used to improve reliability. Once the automatic control fails, the production process can be manually controlled. There are three manual operation modes at different levels in the distributed control system, as shown in the figure below: (1) Manual operation on the operator station This manual operation requires the operator station, communication network, basic control unit and process output channel to be working normally, so it has certain limitations. (2) Operation through I/O module using manual station This manual operation mode has fewer links, so it has higher reliability. However, it still requires the I/O module to be working normally, otherwise manual operation cannot be performed. (3) Direct operation using manual station In this case, the manual station directly outputs 4-20mA or 1-5V analog signal to control the actuator. Therefore, even if the I/O module fails, manual operation can still be performed. This manual operation is also a common operation in power plants. 2. Automatic backup Automatic backup is to set up another set or several backup control devices in a redundant manner. When an automatic control device in operation fails, a backup control device automatically activates to maintain the system's automatic control. Automatic backup is a form of redundant system, which will be discussed in more detail later. II. Software Reliability The above discussion concerns hardware reliability. Below, we will briefly introduce the general concept of software reliability. Research on software reliability started relatively late, but it has gradually attracted attention in recent years. The main reason is that low software reliability not only affects system operation but can even lead to system paralysis and irreversible accidents. For example, in 1963, a hidden software error caused a rocket launched to Mars to explode, resulting in a loss of ten million dollars. Research on software reliability is still relatively immature, but mastering some basic concepts is beneficial for a better understanding of software reliability issues. 1. Software Reliability Initially, software reliability was simply considered to be the accuracy of the software. If software could accurately perform its required functions, it was considered reliable. However, even this most basic requirement is often not met. Statistics show that for newly compiled software, an error occurs on average every 100-4000 instructions. These errors need to be discovered and corrected during debugging, joint debugging, trial operation, and even operation. In recent years, people have given software reliability a broader meaning, namely ease of use and ease of expansion. If a software is not easy to use or expand, it is considered to have defects. The quality of software is mainly determined by the following six factors: (1) Time factor. Like hardware, software also has indicators such as MTBF and MTTR. In addition, there are the following time indicators: Mean Time Between System Downs (MTBD) and Mean Down Time (MDT). (2) Defect frequency includes the number of software defects, the number of document defects, and the number of supplementary requests made by users. (3) Percentages related to software reliability. In addition to percentages similar to those of hardware such as reliability, availability, maintainability, and failure rate, there are the following percentages: Non-conforming rate: Events that cannot be considered failures but should be improved are called non-conforming events, and their occurrence rate is called the non-conforming rate. Delay rate: If a task is required to be completed within a specified time T, and the actual completion time is T1 due to software unreliability, then D = T1 - T is defined as the delay time, and D/T is called the delay rate. Misoperation rate: This is related to the operator's skill level, but to a certain extent it reflects whether the software manual is clear and whether the software is suitable for operation. Unknown cause rate: The failure rate of software failures that cannot be corrected because the cause cannot be found is called the unknown cause rate, which reflects the maintainability of the software. Same failure event rate: The recurrence rate of failures that occur repeatedly after corrective measures are taken is called the same failure event rate. It reflects the incompleteness of corrective measures. Reliability cost rate: The ratio of the cost of software reliability and maintenance to the total software cost is called the reliability cost rate. (4) Software investment includes the number of man-days or man-hours consumed in developing the software, the number of items checked in the software, and the cost of taking countermeasures against user requests. (5) Software characteristics include whether the software is newly developed or basically used, its complexity, standardization, structure and size, and software life cycle. (6) Usage characteristics of software (e.g., online system, real-time system, etc.), computer characteristics, system architecture, quality standards, etc. 2 Software quality standards Here, we will only introduce the following commonly used software quality indicators: (1) Mean time between shutdowns (MTBD). Let Tv be the total normal working time of the software, and d be the number of times the system stops working due to software failure. Then, MTBD = Tv / (d + 1) (2) Number of times the system stops working (within a certain time). The number of times the system stops working due to software failure and must be restarted by the operator to continue working. (3) Mean time to repair (MTTR). It reflects the efficiency of taking countermeasures after software defects occur. For online systems, MTTR < 2d is generally required, and for ordinary systems, MTTR < 7d is required. (4) Mean time to downtime (MDT). The average downtime of the system due to software failure. For online systems, MDT < 10min is required, and for ordinary systems, MDT < 30min is required. (5) Availability (A). Let Tv be the normal working time of the software, and TD be the time when the system is not working due to software failure. Then, we define: A = TV / (TV + TD) The above formula can also be expressed as A = MTBD / (MTBD + MDT) (6) Initial failure rate. Generally, the initial failure rate period is three months after the software is delivered to the user. The initial failure rate is measured in units of failures per 100 hours. It is used to evaluate the quality of the software when it is delivered and to predict when the software reliability is basically stable. Generally, the initial failure rate is required to be no more than 1, that is, less than one failure per 100 hours. (7) Random failure rate. Generally, the random failure period is four months after the software is delivered to the user. The random failure rate is generally measured in units of failures per 1000 hours. It reflects the quality of the software in a stable state. Generally, the random failure rate is required to be no more than 1, that is, less than one failure per 1000 hours, that is, MTBF exceeds 1000 hours. (8) User misuse rate. Errors caused by the user not using the software in accordance with the software manual and other documents are called user misuse. The percentage of user misuses in the total number of uses is called the user misuse rate. (9) Number of supplementary requests from users. If users make supplementary requests to the software, it reflects that the software's functions cannot fully meet the user's needs. If this happens frequently after the software has been used for a period of time, it means that the software has entered the aging period and new software should be developed to replace it. Generally, during the period of occasional failures, the number of requests made by users per month should not exceed 1. If the average number exceeds 1, it is considered that the software has entered the aging period. (10) Processing capacity. There are various indicators for processing capacity. For example, it can be expressed by how many process input variables are processed per second, how many seconds it takes to change each CRT screen, etc. The above introduces some basic concepts of the reliability of distributed control systems. These concepts are very important for the correct understanding and reasonable application of distributed control systems. Therefore, we should pay more attention to the reliability of DCS systems and hope that everyone can have a correct understanding of the safety and reliability of DCS systems.

Reliability of Distributed Control Systems (DCS)

Read next

CATDOLL Alisa Soft Silicone Head

CATDOLL 123CM Olivia (TPE Body with Soft Silicone Head)

CATDOLL 146CM Christina TPE (Customer Photos)

CATDOLL 136CM Miho (TPE Body with Hard Silicone Head)