Fault-tolerant design scheme suitable for core router main control system

Abstract: This paper proposes a fault-tolerant design scheme suitable for the core router's main control system. This scheme combines hardware redundancy and software fault tolerance, along with hot standby and full-duplex operation, instead of the traditional fault-tolerant hot backup method. The fundamental problems in the design of the main control fault-tolerant system are analyzed, and specific implementation schemes for this fault-tolerant system are proposed to address these problems. Test results show that the main control software system using this design scheme has excellent fault tolerance performance and fault recovery capabilities, and can meet the high availability requirements of the core router for the main control software system. Keywords: main control system; core router; fault tolerance; hot backup; hardware redundancy[b][align= center ]One Kind of Fault-Tolerant Design Proposal Suitable for the Core Router Muster Control System Wang the core router muster control system. Analyzed the basic question of muster control system in the fault-tolerant design, and aimed at these questions to propose this fault-tolerant system concrete realization plan. The test result indicated that , used this design proposal ,the muster control software system has good fault-tolerant performance and the breakdown restores ability, could satisfy the high usability request of the muster control software system in core router. redundancy 1 Preface With the rapid development of high-speed networks and people's increasing reliance on networks, the reliability of backbone networks has become particularly important. Furthermore, with the country's increasing investment in network infrastructure and its firm commitment to the localization of core network equipment, the research and development of core network equipment in China has developed rapidly and achieved certain research results. However, there is still a certain gap compared with the equipment designed by some well-known foreign manufacturers and research institutions. This gap is not only in functionality, but more importantly in aspects such as reliability, fault tolerance, and scalability, which directly affect the quality of service provided by the equipment. Therefore, this paper conducts some research and discussion on the high fault tolerance design of the main control system on the core network equipment—the core router—and proposes a high fault tolerance implementation scheme suitable for the core router's main control system. 2 High Reliability Technology High reliability refers to sustainable, consistent, and complete data access. High availability systems achieve high availability requirements by improving server reliability, disk reliability, and application reliability. Specifically, disk reliability can be improved through shared disk arrays, network reliability through redundant networks, server reliability through cooperative servers, and application reliability through application detection and effective recovery. As a core device in a computer network, the high availability of routers is crucial. From a hardware perspective, achieving high availability for routers requires a robust architecture with comprehensive redundancy. Key components such as the routing engine and switching matrix must be redundant. From a software perspective, the router itself must be robust. Furthermore, when encountering network adjustments such as hardware replacements, system upgrades, addition of cards, and changes in links, the software must be able to ensure that the entire network service is not affected by these local adjustments, maintaining very high availability. It must ensure that the routing engine switches without packet loss. If the primary engine fails, the switch to the secondary engine must be smooth without packet loss; otherwise, hardware redundancy is meaningless and becomes pseudo-redundancy. Additionally, a smooth restart must be guaranteed. Typically, when a router restarts, the resulting route recalculation and network-wide route updates consume processing resources and may lead to unexpected network behaviors such as black holes or transient forwarding loops. A smooth restart avoids these situations. Research on network equipment availability currently focuses primarily on equipment manufacturers, due to the high degree of technical specialization and confidentiality, resulting in limited design details available for reference. However, for maintainable systems, the metric for measuring reliability is called availability, and the corresponding theory is called availability theory. A core router is a maintainable system. Based on system reliability, it can be categorized into four levels from high to low: Continuous Availability System, Fault Tolerance System, High Availability System, and Disaster Tolerance System. The first two are generally used in aerospace and military fields, while core routers are required to achieve high availability. System availability refers to the probability that a system will successfully operate according to specifications within an acceptable limit of failures. Research in availability theory mainly includes two areas: fault-avoidance techniques to improve component reliability and fault-tolerant techniques for constructing high-reliability systems using given components. Currently, component reliability research is quite mature and widely used in industry. Moreover, for any system, no matter how many fault-avoidance design methods are used, it cannot be guaranteed to never fail. Therefore, fault-tolerant techniques have become a research hotspot for improving system availability. Currently, this technology is widely used to achieve reliability in core routers. 3. Core Router Main Control System Fault-Tolerant System Design 3.1 Basic Issues in Main Control System Fault-Tolerant System Design Based on fault tolerance requirements, the router must still function normally when the main control system experiences hardware or software failures. Therefore, a 1+1 redundancy design is adopted in the hardware configuration, equipped with two main control boards: a primary (Active) and a standby (Standby) board, constructing a dual-main control hot standby fault-tolerant system. When the primary main control board fails, the system automatically performs a primary-standby switchover, with the standby main control board taking over the work of the primary board to ensure normal service operation. When the primary module experiences a serious failure or is reset, an automatic failover mechanism will be triggered, switching to the standby board in a timely manner. This 1+1 redundancy design can be extended to an N+1 redundancy design. The entire switching process must be transparent to users. The key considerations and implementation difficulties lie in the database consistency issue between the primary and standby systems, the implementation of smooth switching technology, and the fault monitoring mechanism. The router's main control board contains real-time system operation records. Therefore, real-time system data backup is necessary during normal operation to ensure consistency between the primary and backup databases. Otherwise, the backup will not be able to properly take over from the primary during failover. To address this issue, the high availability module employs a partial hot standby design combining duplex and hot standby. The data to be backed up mainly consists of routing and forwarding table entries in the system database. This partial design, combining duplex and hot standby, means that both main control boards run heartbeat detection programs for fault detection. The primary main control board runs all applications necessary for normal router operation, while the backup main control board runs some critical applications. These applications operate normally, receiving the same input data as those on the primary board, but do not output results. This design ensures low latency during router failover, reduces the amount of data that needs to be backed up, avoids the resource waste of full-duplex operation, and avoids many shortcomings of hot standby, resulting in significantly better performance than pure hot standby or duplex methods. Data backup can be categorized into two types: cold backup and hot backup. Cold backup performs a complete database backup while the database is normally shut down. It is the fastest and safest method, but its biggest drawback is that it must be performed with the database closed; performing a database file system backup while the database is open is ineffective. Hot backup, on the other hand, uses archivelog mode to back up data while the database is running. Two solutions are available: dual-machine mirroring and shared disk arrays. Dual-machine mirroring allows you to mirror tables, files, databases, or all content from the primary database server to a backup server via a dedicated connection channel. Its advantages are simplicity and low cost, but it reduces system resources. Shared disk arrays involve two hosts sharing a single disk array. Its advantage is that it doesn't degrade system performance and is currently a popular mainstream technology, but it requires a high level of reliability from the disk array. For routers running in backbone networks, cold backup is clearly unsuitable because it's impossible to periodically shut down the database for backup while the router is running, let alone back up data when the router fails. Therefore, hot backup is used instead. Given the small amount of data to be backed up, it's unnecessary to use dual-machine mirroring and shared disk arrays. This design employs a novel hot backup method: storing the data to be backed up as log files, and using TCP transmission to convert these files into data streams for real-time backup from the primary to the backup machine. Connection-oriented TCP transmission is reliable and fast, with an extremely low probability of file loss, making it highly reliable. Smooth transition—addressing switchover latency: When the router's main control system fails, a transparent primary/backup switchover must be achieved for users. This requires seamless switching between systems, reducing latency and data loss during the switchover process. Seamless switching is a perfect switchover, encompassing both fast and smooth switching. Fast switching means low latency, smooth switching means a low packet loss rate, and seamless switching is a combination of both—low latency and low packet loss rate. For fast switching, the switching process must be completed before the single board drops the link, so that the backup can take over the work of the master and ensure that the various processes in the router are normal and not affected by the master control failure, and that the normal operation of the network is not affected. For smooth switching, there are two requirements: first, the databases on the master and backup master control boards are consistent during switching; second, on the basis of consistency between the master and backup databases, the backup can complete the import of backup data within the specified time [4] after the backup is enabled. The specified time is also included in the total switching time. Switching time [4] = time of fault discovery + time of enabling switching + time of fault takeover. The two master control boards in the fault monitoring mechanism system determine the master and backup status after master-slave negotiation. One is in the Master state, controlling the entire system; the other is in the Slave state, in the backup state. The two master control boards communicate their own status data through UDP heartbeat messages to identify the software/hardware faults of the master control. During normal operation of the router, the master and backup master control boards periodically send keepalive messages to each other for heartbeat detection. The message content contains their own status information. If the standby device does not receive a keepalive message from the primary device before the timer expires, it considers the primary device to have failed and enters a primary-standby switchover to become the new primary device, automatically taking over the original primary device's service program and continuing to provide services. After the original primary device recovers from a failure or is replaced, it will resend a negotiation message to contact the new primary device and become the new standby device, without needing to perform another switchover, thus saving system resources. 3.2 Design and Implementation of the High Availability Module In the design scheme of the fault-tolerant system for the main control software, two main control boards are used to connect eight single-board units. The two main control boards exchange heartbeat data through a connectionless UDP communication mechanism and transmit backup file data streams through a connection-oriented TCP communication mechanism; the main control boards and single-board units are connected via high-speed Ethernet. Figure 1 shows the overall structure diagram of the system. According to different functions, the high availability module is divided into three sub-modules in the design scheme: the AS communication module, the AS system monitoring module, and the AS Keepalive module, as shown in Figure 2. The AS communication module is responsible for communication, data backup, and TCP transmission between the high availability module, system data maintenance module (SYSDATA), and inter-board communication module (BDCOM) on the main control system. The AS monitoring module is responsible for core functions such as monitoring, maintenance, and management of various system processes in the main control software. When a certain software consumes an excessive percentage of CPU, it is considered that the main control software is malfunctioning. Based on the software's operating rules and importance, a recovery strategy is selected, either restarting the process or initiating a primary/standby switchover. The Keepalive module is responsible for master-slave negotiation between the two main control boards, determining their master/slave status. During normal router operation, it periodically sends keepalive messages to the other main control board for heartbeat detection. To address packet loss due to network congestion and potential timeouts in processing keepalive messages when the CPU is queuing for multi-threaded processing, which could cause the primary control board to "freeze," a re-negotiation technique is employed. If a keepalive message is not received from the other main control board after a timeout, it is not assumed to be faulty; instead, a backoff is initiated, and a negotiation message is sent to re-negotiate. This re-negotiation differs from the master-slave negotiation during initialization. Compared to the commonly used, simple, fixed heartbeat detection technique, the re-negotiation technique improves the system's adaptability and stability to different heartbeat environments, thus better ensuring high system availability. [align=center]Figure 1 Overall Structure Diagram of the Fault-Tolerant System for Main Control Software[/align] [align=center]Figure 2 Detailed Design Diagram of the High Availability Module[/align] 4 System Fault Tolerance Testing This paper uses an Adtech AX/4000 router tester to test the efficiency and reliability of HAL under different loads. The test results are shown in Figure 3. The packet transmission rate during the test follows a Markov Modulated Poisson Process (MMPP). Figures 3-1 and 3-2 show the test results of router throughput and latency under different fault conditions, respectively. The test results show that the fault-tolerant design of this main control system can handle various errors that occur during router operation, although its latency and throughput will be affected to some extent. Especially under a 10% fault condition, the system latency is not particularly large, which may reduce the impact of system failure on users. Of course, under high fault conditions, the system throughput drops very significantly. Therefore, the next design will focus on studying the cause of this phenomenon and improving it. [align=center]Figure 3-1 Latency Test Figure 3-2 Throughput Test[/align] 5. Summary This paper studies the main control system architecture of a T-bit core router and designs a high availability module. This module adopts a hot backup mode, eliminating single points of failure in the T-bit router's main control system through hardware redundancy settings on the main control board, combined with software implementations such as data hot backup and heartbeat detection. This module is applied to the T-bit router's main control software system. When the primary main control board fails, it can quickly, accurately, and smoothly perform primary/backup switching, thereby improving system stability and reliability, and ultimately achieving high availability for the router. References: 1. Department of Science and Technology, Ministry of Information Industry of the People's Republic of China, YD/T1097-2001 "Technical Specification for Router Equipment - High-End Router". 2001. 2. Cisco White Paper, The Evolution of High-End Router Architectures, Basic Scalability and Performance Considerations for Evaluating Large-Scale Router Designs. 3. Vitesse Semiconductor Corporation, Longmont, Colorado, IQ2000 Network Processor Product Brief, 2000. 4. James Aweya, On the design of IP routers Part 1: Router architectures, Journal of Systems Architecture 46 (2000) pp:483-511. Yan Yonghong, Zhang Fan. Hardware optimization of TCAM route update [J]. Microcomputer Information, 2006, 12-2: 254-256.

Fault-tolerant design scheme suitable for core router main control system

Read next

CATDOLL 146CM Mila TPE

CATDOLL CATDOLL 115CM Shota Doll Kiki Male Doll

CATDOLL Milana Soft Silicone Head

CATDOLL Dora Hard Silicone Head