Hardware optimization for TCAM router updates

Abstract: Modern core routers have increasingly higher requirements for lookup rate, table update speed, and lookup table capacity. At present, most industrial manufacturers adopt TCAM (Tri-state Content Association Memory) based solutions. The biggest feature of TCAM is its fast lookup speed, but its update algorithm wastes a lot of storage space. To address this problem, this paper proposes a routing update method that uses FPGA to provide hardware support. When adding a new table entry, only one preprocessing is needed for the new table entry. The forwarding table does not need to be sorted by prefix length, eliminating the storage space waste caused by reserving free table entries. Keywords: Router; Tri-state Content Association Memory; Prefix; Forwarding Table 1 Introduction Figure 1 PLO-OPT algorithm illustration Fast routing lookup has always been a hot research project. In recent years, people have proposed many software-based[1] or hardware[2-5] implementation methods. With the continuous improvement of router interface speed, it has become increasingly difficult to implement high-speed routing lookup using software methods. At present, the most commonly used hardware routing lookup method in the industry is TCAM. The advantage of TCAM is that the entries it retains are very flexible in terms of length requirements. It can store keyword entries of any length in the same TCAM chip [6]. Therefore, TCAM is very suitable for finding the longest prefix route. However, since TCAM simply outputs the storage address of the lowest matching entry as the result (index), in order to ensure the longest prefix match, the entries must be stored in descending order of the prefix length relative to the address. That is, the lower address stores entries with longer prefixes and the higher address stores entries with shorter prefixes. This storage order relationship makes the TCAM update operation complicated. When a new prefix entry is added, some prefix entries need to be moved in order to maintain the order relationship between entries. The existing PLO-OPT algorithm (prefix length ordering constraint algorithm) [5] for IPv4 is currently the best algorithm for processing TCAM entry updates. This algorithm stores entries of different prefix lengths from long to short in sequence from low address to high address, and concentrates the free TCAM entries in the middle position of the address area where the prefix entries are stored, forming a free pool, as shown in Figure 1. When a new entry is added, all entries above or below the added position are moved one address to the middle free pool to create a free space. Assuming the TCAM stores N routing information entries, the free space size must be maintained at N/2 entries, which is obviously too costly on a backbone router with hundreds of thousands of entries. 2. Update Entry Preprocessing To improve TCAM storage space utilization, we want irrelevant entries to avoid being moved during update, thus eliminating the need to construct a free pool. If low-address entries can ensure they do not contain high-address entries (i.e., the prefix of a low-address entry is not a prefix of the prefix of a high-address entry), then searching the TCAM will always return the entry with the longest prefix match. To ensure that low-address entries do not contain high-address entries, this paper proposes the following algorithm: 1. Construct a RAM to store prefix lengths, with the same depth as the TCAM and a width of 5 bits (suitable for IPv4). When an entry is inserted, its prefix length is stored in RAM at the same location as its position in the TCAM for easy lookup. 2. Perform a read operation on TCAM using the address portion of the update entry (update entries appear in the order of <address, mask>) as the key. 3. If a match exists, read the prefix length stored in RAM based on the TCAM output, denoted as 'a'. If no match exists, write the update entry to the address following the last entry in TCAM. 4. If a match exists, and the prefix length of the update entry is denoted as 'b' (b>=a is always true), then expand the matching entry into 2b-a sub-entries. Example: If TCAM contains entry 01011* (5-bit prefix), inserting entry 0101110* (7-bit prefix) results in a matching entry, which is then expanded into four sub-entries: 0101100*, 0101101*, 0101110*, and 0101111*. 5. Use the generated sub-item as the key to perform a lookup operation in TCAM. If a sub-item finds a match, the sub-item overwrites the storage location of the match. If no match is found, it is stored in the last entry in TCAM. During the insertion of sub-items into the list, multiple sub-items may match the same entry, but only one sub-item needs to overwrite the match. No further expansion of the match is needed because the matching result after querying with the sub-item as the key will always appear at the lower address of the first match. The sub-item generated after further expansion will definitely be included in the result of the previous expansion. This preprocessing ensures that the updated entry will not be included in any entry at the lower address. TCAM routing tables constructed in this way do not need to free up entry space for updates, thus avoiding space loss caused by constructing a free pool. Figure 2 shows the schematic diagram of the prefix preprocessing circuit. 3. Prefix Preprocessing Circuit Principle Figure 2 is a schematic diagram of the preprocessing circuit. The updated entry is separated into an address part and a mask part in the input interface. First, the address part is ANDed with the inverted mask, and then this is used as the key and sent to TCAM. If no match is found, the update entry is directly written into the TCAM, and the prefix length b obtained after mask decoding is stored in the same location in RAM. If a match is found, the prefix length a of the corresponding entry is read from RAM based on the matching result, a is encoded into a mask, inverted, and then ANDed with the address part in the update entry. This value is the first sub-entry generated. Using this value as the base, 2b-a-1 increments are performed to obtain all sub-entries. Finally, all sub-entries are used as keys to search for matches in the TCAM. If a match is found for a sub-entry, it is overwritten. If no match is found for a sub-entry, it is written to the address pointed to by the address register, which is the address after the last entry in the TCAM. Then, this address is incremented by 1 and written back to the address register. (Datain is the input port for TCAM key lookup, wrx is the write port for TCAM update, address is the input address port for TCAM update, matchaddress and matchfound are TCAM output ports; other TCAM ports are not listed.) Figure 3 shows a schematic diagram of the state transition of the control state machine. The control state machine determines the operating state of the entire circuit by controlling the data path, and its state transition is shown in Figure 3. Initially in the idle state; when an update entry arrives, the state changes from idle to lookup1. In this state, the address portion of the update entry is released to the TCAM's datain port for matching. If matchfound=0, indicating no match, the state changes to renew1, releasing the update entry to the TCAM's wrx port and the address in the address register (the address of the last routing entry written to the TCAM during initialization) to the TCAM's address port. Then, the state changes to writeram1, setting the RAM to write mode, releasing the address in the address register to the RAM's addr port, and writing the prefix length to the RAM. If matchfound=1, indicating a match, the state changes to readram, setting the RAM to read mode, releasing matchaddress to the RAM's addr port, and reading the prefix length 'a' of the match entry. Then, the state changes to judge, checking if 2b-an is 0 and incrementing n by 1. If 2b-an is not 0, the state changes to lookup2; otherwise, the state changes back to idle. In the `ookup2` state, the generated sub-item is released to the TCAM's `data` port for matching. If `matchfound` = 1, it enters the `renew2` state; otherwise, it enters the `renew3` state. In the `renew2` state, the control state machine releases the generated sub-item to the TCAM's `wrx` port and releases `matchaddress` to the TCAM's `address` port (the generated sub-item will overwrite the matching item). After the `renew2` state, it enters the `writeram2` state, sets `ram` to write mode, and releases `matchaddress` to the `ram`'s `address` port (the prefix length of this address in `ram` will be rewritten). After the `writeram2` state, it enters the `judge` state to continue judging. If it enters the `remew3` state, the state machine releases the sub-item to the TCAM's `wrx` port and releases the address in the address register to the TCAM's `address` port. Then it enters the `writeram3` state, sets `ram` to write mode, releases the address in the address register to the `ram`'s `addr` port, and increments the number in the address register by 1. Finally, it enters the `judge` state to continue judging. 4. Implementation of Prefix Preprocessing Circuit using FPGA Figure 4. Simulation waveforms of four sub-entry expansions. With the rapid development of the Internet, router port speeds have increased from OC3 (155.52Mbps), OC12 (622.08Mbps), OC48 (2.448Gbps) to OC192 (10Gbps), and even OC768 (40Gbps). The processing speed for routing updates within routers must also increase. The prefix preprocessing circuit proposed in this paper, as an auxiliary device in the router, also needs to meet certain data processing speed requirements. Due to its high speed, small area, reliable performance, low cost, and strong portability, FPGA has become the preferred solution for high-speed digital circuit design. This paper uses FPGA to implement the design of the prefix preprocessing circuit. The hardware description language used in the design is Verilog HDL, compiled and simulated under ModelSimXE5.7, and the synthesis tool is Synplify pro7.1, a logic synthesis tool from Synplicity specifically for FPGA/CPLD. The selected device is a Xilinx Virtex2 series, model XC2V500, speed -6, package FG256. After the circuit description language is compiled, a test platform is established through the top-level file. The platform includes a preprocessing circuit module, a signal source module, and a TCAM module. The preprocessing circuit module is synthesizable, while the signal source module and TCAM module are behavioral-level virtual modules and cannot be synthesized into a gate-level netlist. During simulation, the signal source module inputs the routing table entry (address/mask sequence pair) to be updated to the preprocessing circuit. After processing by the preprocessing circuit, a new update entry (original prefix or original prefix extended sub-entry) is generated and output to the TCAM module. The routing prefix in the TCAM module is stored in memory. Before simulation, the memory itself has no routing information. During simulation, a routing table needs to be read from the local disk first using the $readmemb system command. Figure 4 shows the simulation waveform of the table entry expansion. As can be seen from the figure, the input update entry is expanded into 4 sub-entries, of which 2 sub-entries also have matching entries found in the TCAM. After functional simulation and post-synthesis simulation passed, Xilinx's ISE 5.2 placement and routing tool was used to place and route the prefix preprocessing circuit. The constraint frequency used during the process was 80MHz. The ISE static timing analysis report shows that all paths meet the 80MHz constraint. After placement and routing, the standard delay file was back-annotated onto the simulation model in ModelSim. Simulations under various stimuli showed that the preprocessing circuit fully meets the design requirements. Conclusion The authors' innovation lies in preprocessing updated routing table entries to ensure that updated entries are not included in lower-address entries. This eliminates the need to arrange all entries by prefix length, thus avoiding the practice of maintaining a free pool of depth N/2 in the POL-OPT algorithm and saving valuable TCAM storage space. The entire circuit scheme is reasonable and feasible, with a maximum operating frequency of 80MHz. References [1] Miguel A, Ruiz-sanchez Survey and taxonomy of IP address lookup algorithms [J], IEEE Network, 2001, (2) 8-23 [2] Pankaj Gupta, Steven Lin, and Nick McKeown. Routing lookups in hardware at memory access speeds [A]. IEEE NFOCOM [C]. San Francisco: Prentice Hall, 1998 1240-1247. [3]Mcauley AJ, Francis P. Fast routing table lookup using CAMs[A]. Proc IEEE NFOCOM[C] San Franciso.Prentice Hall, 1993.1382-1391. [4] Kobayashi M, Murase T, Kuriyama A. A longest prefix match search engine for multi-gigabit IP processing [A]. Proceedings of ICC 2000[C]. New York: Cannes Hall, 2000 834-842 [5]Devavrat Shah, Pankaj Gupta Fast updating algorithms for TCAMs [J]. IEEE Micro, 2001, (1):36-47. [6]Tu Zhen, Liang Jinshan, Yang Kuiwu. Application of TCAM in high-speed route lookup and its FPGA implementation [J]. Microcomputer Information, 2005, 21(4):208-209.

Hardware optimization for TCAM router updates

Read next

CATDOLL 128CM Ava (TPE Body with Soft Silicone Head)

CATDOLL 115CM Milana TPE

CATDOLL 136CM Mila

CATDOLL 132CM Wendy(TPE Body with Hard Silicone Head)