The special role of intelligent monitoring and management in improving UPS system availability

Among the metrics for measuring the safety performance of a UPS system, two are particularly important: system reliability and availability. As a key component for improving power system quality, the reliability and availability of the UPS system itself are the most important and fundamental indicators for evaluating its performance. This paper provides a detailed analysis of the factors affecting UPS availability, thereby deriving effective methods for improving system availability by adopting advanced UPS intelligent management technologies. New UPS management technologies and products are of great significance for improving the availability of UPS systems.

As can be seen from the definition of system availability, there are two ways to improve UPS system availability: one is to improve system reliability, i.e., extend the Mean Time Between Failures (MTBF), and the other is to reduce the Mean Time To Repair (MTTR). The relationship between UPS system MTTR and UPS system availability shows that shortening MTTR has a more significant effect on improving system availability.

This section analyzes the composition of Mean Time To Repair (MTTR) in detail using a specific case study. The case is an 80kVA UPS system. If such a UPS system fails, it typically requires professional technicians from the manufacturer for repair. Many manufacturers offer service commitments such as "4-hour response" and "24-hour repair" for such systems. However, it's important to note that these times are not the actual fault recovery time. First, the so-called "4-hour response" usually only refers to the time from when the manufacturer's engineer receives notification from the user to when they develop an on-site repair plan; there's still a considerable distance between that and the actual fault repair. "24-hour repair" also has many additional conditions, such as the availability of engineers and spare parts at the location of the faulty equipment. In fact, the actual fault repair time is closely related to every step of the entire fault repair process.

The following detailed analysis of the repair time of the above UPS system failure cases, broken down by actual segments, reveals that the repair time for a single failure consists of the following time periods:

Fault alarm notification time. The time from the occurrence of the fault to the user's discovery of the fault is represented by T1.

Manufacturer response time. The time from when the user reports the fault to the manufacturer's after-sales service department to when the manufacturer's after-sales service engineer communicates with the user and makes an on-site repair plan, denoted as T2.

Initial fault diagnosis time. The time it takes for the manufacturer's after-sales service engineer to communicate with the user via telephone or other means to understand the fault symptoms and process, and to make a basic fault diagnosis, is denoted as T3.

On-site service time. T4 represents the time from when the manufacturer's after-sales service engineer communicates with the user via telephone or other means to make a basic diagnosis of the fault, until on-site service is provided.

Troubleshooting time. The time from when the manufacturer's after-sales service engineer arrives at the site to when the fault is resolved is represented by T5.

1. First, let's analyze the first time period – the fault alarm notification time T1.

While this timeframe may seem short, it is actually highly uncertain. Firstly, medium and large-capacity UPS systems are typically installed in dedicated power supply rooms, which are usually unattended due to noise and safety concerns. Therefore, UPS failures often go unnoticed until serious consequences have occurred. Secondly, as high-voltage equipment, UPS systems require specialized personnel with expertise for routine maintenance. Consequently, after a failure occurs, on-site assessment and diagnosis by qualified personnel are necessary before any appropriate action can be taken, further hindering the speed of fault notification. Due to these factors, coupled with uncertainties related to distance and specialized knowledge, the UPS fault notification time (T1) becomes highly unpredictable, potentially becoming a significant factor reducing system availability.

Here is a specific real-world example. A bank's data center in Tianjin used a 125kVA UPS to power the data center. The UPS system was installed on the second basement level and was normally unattended. One morning at 10:00 AM, the UPS system suddenly experienced a 10-second power outage, paralyzing the entire data center. On-site inspection by engineers revealed that the UPS had no hardware faults; it was simply operating in bypass mode at the time of the fault. Reviewing the UPS's operating history revealed that the mains power experienced a brief 10-second outage. Because the UPS was operating in bypass mode, the mains power was essentially supplying power directly to the load, so the mains outage directly affected the load. However, further investigation revealed that the UPS had actually been in bypass mode for two days prior, caused by a large-capacity load starting up, leading to overload and locking into bypass mode (UPS operating mode setting). Although the UPS issued an audible alarm signal at that time, due to the distance, the staff did not hear the alarm, and the problem was only discovered after serious consequences occurred. This case illustrates that the fault notification time T1, which is often considered unimportant, can be as long as two days. Due to the significant uncertainty involved, it actually has a substantial impact on MTTR, which may be a major reason for reduced UPS system availability.

2. Let's look at the second time period – the manufacturer's response time T2.

Because the repair of medium and large capacity UPS systems requires specialized knowledge and skills, it is usually performed by the manufacturer's technicians. The length of this repair process reflects the manufacturer's level of attention and capability in after-sales service. Different manufacturers provide 5x8 (5 days a week, 8 hours a day during legal working hours) or 7x24 (7 days a week, 24 hours a day) after-sales service response for different products.

3. Let's look at the third time segment – the initial fault diagnosis time T3.

To expedite fault repair, after-sales service engineers typically communicate with users via telephone or other means before providing on-site repair services. This allows them to understand the fault symptoms and obtain information about the UPS system's fault status and related details. This initial fault assessment is crucial, guiding the subsequent on-site repair. The duration of this assessment depends on several factors, including the user's maintenance skills and the system's pre-fault operating condition, the after-sales service engineer's technical and communication abilities, the product's intelligent management and ease of use, and its user-friendliness. For example, the more familiar the user with the UPS system and the higher the technical level of the user's maintenance personnel, the shorter the initial fault assessment time. Besides the significant impact of the user's and after-sales service engineer's technical capabilities on T3 (Time To Repair), non-technical factors such as communication skills often become crucial in determining T3's duration. Differences in dialects, language habits, and even personalities between the user and the after-sales service engineer, as well as the after-sales service engineer's communication skills, all directly affect the effectiveness of communication, thus influencing T3's duration.

4. Let's look at the fourth time period – on-site service time T4.

The on-site service time of manufacturer engineers is affected by factors such as spatial distance, weather conditions, and traffic conditions, but it is relatively easy to control and can be treated as a relatively stable parameter when performing MTTR analysis.

5. Finally, let's look at the fifth time segment – troubleshooting time T5.

This period is not only related to the technical skill level of the after-sales service engineers, but also directly affected by the results of the preliminary fault diagnosis in the third step. An error in the preliminary fault diagnosis may result in the availability of spare parts that cannot meet the repair needs, thus delaying the repair of the fault. Furthermore, the structural design of the UPS system also significantly impacts troubleshooting time. For example, some manufacturers use modular designs for their UPS systems, greatly shortening the replacement time for faulty components. Other manufacturers employ so-called "N+1" modular and redundant configuration technology, which further shortens the fault repair time (T5).

In summary, among the various stages affecting fault repair time, in addition to the manufacturer's service standards and the engineer's technical level having a significant impact on fault repair time, the fault alarm notification and initial fault diagnosis stages are often the main reasons for prolonging the fault repair time (MTTR) because they are easily affected by many uncertain factors and have great uncertainty. At the same time, they are often overlooked.

To effectively shorten T1 (fault alarm notification time), T3 (preliminary fault diagnosis time), and T5 (fault troubleshooting time), firstly, the UPS system must have a remote fault alarm function. When a fault occurs, the UPS system can promptly report fault information to system operation and maintenance personnel who are not on-site through various effective remote alarm methods. Secondly, after-sales service engineers can understand the fault situation through direct and objective means, thereby obtaining accurate and complete information about the fault and avoiding information distortion or omissions caused by human factors.

To enable UPS systems to possess new functions such as remote alarm, remote testing, remote fault diagnosis, and remote repair, new power management technologies (including a range of accessories and software products) are required. The following section further describes the fault repair process after adopting these power management technologies, demonstrating the profound impact power management technologies are having on the availability of UPS systems.

Equipping the UPS system with a new remote alarm management card allows system administrators to configure it. Once configured, the remote alarm management card can automatically perform periodic checks on the UPS according to the administrator's settings. When the remote alarm management card detects a potential problem or fault in the system, it will immediately and automatically send an alarm notification to the operation and maintenance personnel via telephone, paging, email, or SMS, preventing the fault from occurring or promptly notifying the manufacturer's after-sales service department, thus shortening the alarm time T1 to the "minute level". After receiving the alarm notification, the UPS system maintenance personnel immediately notify the manufacturer's after-sales service personnel. The manufacturer's after-sales service engineers can directly access, remotely detect, and remotely diagnose the faulty UPS via telephone network or the Internet, as well as download UPS operating parameters and operating history. All of this is done directly by the after-sales service engineers without user intervention, avoiding human interference and making the initial fault judgment more accurate. This significantly shortens the initial fault judgment time T3 and lays the foundation for shortening the fault troubleshooting time T5. After accurately identifying the fault, the after-sales service engineer can handle it accordingly. If the fault is simply due to improper system parameter settings, remote adjustment of the relevant UPS system parameters is sufficient to resolve the issue. If on-site troubleshooting is required, the engineer can bring spare parts for on-site repair. Because the initial fault diagnosis is relatively accurate, the troubleshooting time (T5) is correspondingly shortened. The overall mean time to recovery (MTTR) is thus significantly reduced, thereby substantially improving system availability.

The special role of intelligent monitoring and management in improving UPS system availability

Read next

Enhanced Application Design of Capacitive Touch Sensors

CATDOLL 138CM Airi (TPE Body with Soft Silicone Head)

CATDOLL Nonoka Soft Silicone Head

CATDOLL Rosie Hard Silicone Head