Research and Implementation of VIM-based Embedded Storage Controller
2026-04-06 05:57:55··#1
1. Introduction With the rapid development of VLSI technology, microprocessor clock speeds are increasing daily, and performance is growing rapidly. Although memory integration is also increasing and access latency is decreasing, processor performance grows at an annual rate of 50%–60%, while memory performance only improves by 5%–7% annually. The low bandwidth and high latency of DRAM memory prevent high-performance processors from fully utilizing their capabilities, and the speed gap between processors and memory is increasingly becoming a bottleneck restricting the overall system performance. Many researchers have started from microarchitecture, adopting techniques such as out-of-order execution, multithreading, prefetching, branch prediction, and inference execution, or hierarchical storage structures with multi-level caches, to compensate for the performance gap between microprocessors and memory. However, these techniques have several problems: complexity, large footprint, low resource utilization, high cost, and exhaustion of memory bandwidth, failing to truly solve the memory bottleneck problem. Even with the emergence of new memory products, such as DDR memory and Rambus memory, although memory frequencies have increased to some extent, the performance gap between processors and memory continues to widen. How can we truly eliminate memory performance bottlenecks? PIM (Processing in Memory) technology tightly integrates the processor and memory onto a single chip. Advances in semiconductor manufacturing processes allow CMOS logic cells to be integrated with SRAM or DRAM on a single silicon chip. This overcomes the limitations of inter-chip pins, fully utilizes the hidden bandwidth of memory, and reduces access latency (converting inter-chip access latency into intra-chip access latency). Based on PIM technology, the Vector In Memory (VIM) architecture, which uses vector components as coprocessors, can fully leverage the high bandwidth, low latency, and low power consumption characteristics of PIM to develop data-level parallelism, making it an effective way to solve memory system performance bottlenecks. This paper specifically describes the design and implementation of the embedded memory controller, a key component affecting memory system performance in the VIM architecture. 2 VIM Architecture VIM is a vector architecture oriented towards streaming data processing. Its microarchitecture's processor part consists of one scalar core and one vector coprocessor. The embedded memory controller and memory constitute an on-chip DRAM memory system, interconnected by a high-speed memory crossbar switch. The most important feature of the VIM architecture is the combination of the vector processor and the embedded memory. Figure 1 shows a typical framework structure of the VIM system. 2.1 RISC Scalar Core (VIM) The Scalar Core is a synthesizable, highly integrated 32-bit RISC processing core with an instruction set compatible with SPARC V8. It includes an integer unit and a floating-point unit, supporting both user mode and superuser mode (supervisor mode). Its main functions are executing scalar instructions (SPARC instruction set), handling exceptions and interrupts, delivering vector instructions to the vector unit, completing data transfer between the Scalar Core and the vector unit, and communicating with the host to complete instruction fetching and decoding. 2.2 Vector Unit The Vector Unit acts as a coprocessor for the Scalar Core, executing extended vector instructions in parallel with the Scalar Core. It consists of a vector instruction queue, a vector controller, a vector core register file, and vector lanes. The Vector Instruction Queue (VIQ) is an asynchronous FIFO queue. The Scalar Core partially decodes and writes the identified vector instructions into the VIQ. The vector fetch unit then sequentially fetches instructions from the VIQ and passes them to the decoder for further decoding. The vector controller ensures the correct execution of vector instructions and completes vector pipeline control. The vector core register file consists of a vector register file, a scalar register file, a control register file, and a flag register file, including 32 32-bit registers. The vector lanes are constructed using parallel vector pipelines, including fully piped LSU, ALU, FPU, and other components. Each lane can be viewed as a data path, processing data of a specific width; multiple lanes can execute in parallel. 2.3 Embedded Memory System and I/O The embedded memory system consists of a memory interconnect crossbar switch, an embedded memory controller, and memory. The memory interconnect crossbar switch is used for address translation, memory access instruction routing, and data return during on-chip arithmetic unit memory access. It interconnects the scalar memory access unit, the vector lane memory access unit, and the embedded memory, featuring multi-port, high bandwidth, and low latency. The embedded memory controller receives memory access instructions from the memory crossbar switch to access DRAM data. The memory is a multi-bank interleaved memory composed of four independent memory banks. Each memory bank has its own independent access control interface and is further divided into four sub-banks. Using multi-bank memory allows memory access instructions mapped to different banks to be executed simultaneously, enabling parallel operations on multiple vector elements. Each bank has its own memory controller, resulting in lower access latency and lower power consumption compared to centralized memory control. I/O interfaces are used to interconnect multiple VIM nodes and connect I/O devices. The embedded memory controller implements functions such as initialization, activation, row selection, column selection, automatic charging, timed refresh, read latency, and write recovery for the DRAM memory banks. It supports four burst read/write modes and two row selection latency options, with strict timing requirements. The design scheme, design concept, and specific implementation of the embedded memory controller will be described in detail below. 3 Embedded Memory Controller Design and Implementation3.1 Embedded Memory Controller Module Structure The memory controller is the control interface between the system bus and DRAM. In the VIM memory system, the embedded memory controller's role is to convert the read/write control commands transmitted from the memory crossbar switch into DRAM control signals to control DRAM read/write and refresh operations; to decompose the addresses transmitted from the memory crossbar switch into Bank addresses, row addresses, and column addresses, transmitting them to the DRAM address lines at specific times; and to control the data input/output of the data bus and DRAM data lines. The VIM-1 embedded memory controller is implemented, and its module structure is shown in Figure 2. The memory controller consists of a main control module, a signal generation module, a refresh module, and a data path. The functions of each module are described in detail below: 3.1.1 Main Control Module The main control module controls various functions of the DRAM and consists of an initialization state machine submodule, a command state machine submodule, and a counter submodule. Under the timing control of the timing submodule, the initialization state machine submodule generates various states required during DRAM initialization. It controls the initialization of the DRAM module through state transitions and simultaneously transmits its state signal iState to the command state machine submodule, signal generation module, timer submodule, and data path. Under the timing control of the timing submodule, the command state machine submodule generates various states required during DRAM read/write cycles and refresh cycles. It controls read/write access and refresh operations on the DRAM module and transmits its state signal cState to the data path module, timer submodule, and signal generation module. The timer submodule performs timing control on the internal operations of the DRAM module according to the DRAM module's timing standard. This is mainly achieved by controlling the state transition timing in the initialization state machine and command state machine. 3.1.2 Refresh Module Generates refresh request operations for the DRAM module: Controlled by an internal counter, a refresh request is sent to the command state machine every certain number of clock cycles (the specific number of clock cycles depends on the DRAM module parameters) until the command state machine responds with a refresh request acknowledgment. 3.1.3 Signal Generation Module The signal generation module converts the iState state sent by the initialization state machine and the cState state sent by the command state into the internal command signals corresponding to the DRAM. These mainly include sdr_CKE (clock enable signal), sdr_CSn (chip select signal), sdr_RASn (row select signal), sdr_CASn (column select signal), and sdr_WEn (read/write signal). Simultaneously, under specified timing, it converts the address signals transmitted from the address bus into the corresponding Bank address and row/column address within the DRAM. 3.1.4 Data Path The main function of the data path module is, under the timing control of the timer submodule, to write data from the data bus into the DRAM memory according to the command state signals at the corresponding state; and to output data from the DRAM data line sdr_DQ to the system bus, while simultaneously setting the data validity signal sys_D_Valid to 1 during data output . 3.2 DRAM Initialization Typically, before the DRAM can perform memory access and is operating normally, it needs to be initialized. The initialization state machine submodule in the main control module implements the DRAM initialization operation, as shown in Figure 3. Figure 3 is the state transition diagram of the initialization state machine. The initialization process is as follows: When the system is reset, the DRAM is in a no-operation state, and the initialization state machine is in the i_NOP state. When the system reset is complete and the power supply and clock are in a stable state, the DRAM initialization sequence begins to execute. After one charging, two refreshes, and loading the working mode operation, it finally enters the ready state, and the initialization sequence is completed. The loading mode state in the initialization state machine starts the DRAM internal loading working mode command, loads the data on the address bus into the DRAM mode register, and configures the working mode required by the user. The contents of the mode register define the burst length, burst type, and CAS delay, etc. As long as the DRAM module is in a no-operation state, the mode register can be loaded with different values, thereby changing the DRAM's working mode. Furthermore, since the DRAM delay period varies depending on the actual DRAM speed level, the number of clock cycles waited in the delay state is related to the clock cycle tCK. When the clock cycle tCK is greater than the delay time (here, the delay time refers to the charging cycle, refresh cycle, and load delay), there is actually no need to wait. During the initialization process, the transition between the charging, refresh, and load working modes to the final ready state is shown by the dashed line in Figure 3. 3.3 Read/Write Cycle Figure 4 shows the state transition diagram of the VIM-1 embedded storage controller command state machine. State transitions control the read, write, and refresh operations on the DRAM. During system reset, the DRAM is idle, and the command state machine is in an idle state. After DRAM initialization, the command state machine detects the bus address strobe signal sys_ADSn (active low, indicating a bus request) and the refresh request signal sys_REF_REQ. If there is a refresh request, the command state machine module controls the DRAM to enter the refresh cycle; otherwise, if the bus address strobe signal sys_ADSn is active, the command state machine transitions to the active state, and the DRAM enters the read/write cycle. The command state machine unconditionally transitions from the active state to the active delay state. In the active delay state, it detects the read and write command signals on the system control bus. If the signal is high (indicating read access), the command state machine transitions to the read operation state and then performs data reading operation on the DRAM; otherwise, the state machine transitions to the write operation state and performs write access. The state transition process required for a complete read access cycle of DRAM is idle-active-active delay-read operation-CAS delay-data output, and finally returns to the idle state. The state transition process required for a complete write access cycle of DRAM is idle-active-active delay-write operation-data writing-write recovery-idle. In the implemented VIM-1 embedded storage controller, the DRAM address bit A[10] is kept at a high level and set to automatic charging mode, so the charging of DRAM is hidden in the DRAM read and write command operation. The delay waiting time involved in the command state machine transition process is determined by the DRAM module speed and delay parameters. The dashed line between c_ACTIVE and c_WRITE or c_READ states indicates that when the activation delay is less than 1 clock cycle, DRAM directly transitions from the active state to the read/write operation state. The delay is actually hidden in the clock cycle of the state transition. 3.4 Refresh Cycle DRAM memory needs to be refreshed periodically. The refresh cycle is described as follows: (1) The refresh module sends a refresh request signal sys_REF_REQ to the control module after each refresh request cycle; (2) The control module of the memory controller sends an acknowledgment signal sys_REFACK to confirm the request signal; (3) The acknowledgment signal is valid throughout the refresh phase. After the sys_REF_REQ signal is sent, it must be acknowledged by sys_REF_ACK, otherwise it will remain high; (4) During the period when sys_REF_ACK is valid, read and write access is not allowed until the refresh cycle is completed. During the refresh, all commands of the system interface will be ignored; (5) After receiving the refresh request, the command state machine completes the refresh control of DRAM through the transition process of c_AR refresh state - c_tRFC refresh wait state - c_idle idle state. It waits to receive system memory access commands to complete the next read/write cycle. 3.5 Timing Control Timing control is a key component in the implementation of the memory controller. In the VIM-1 embedded memory controller, a timer submodule is used to implement the timing control of the internal state machine. The timer submodule includes a clock cycle count latch ClkCNT and a timing reset signal Reset_ClkCNT. The specific timing control is described as follows: (1) The latch ClkCNT increments by 1 every clock cycle until the Reset_ClkCNT reset signal is set to 1 and ClkCNT is cleared to 0; (2) When the state machine transitions to a state that requires a delay operation, the Reset ClkCNT reset signal is set to 1 and ClkCNT is cleared to 0; (3) The state machine enters a delay waiting state and simultaneously sets the Reset_ClkCNT reset signal to 0. The timer starts counting from 0 and ClkCNT increments by 1 every clock cycle; (4) When the ClkCNT value reaches the specified delay time, the control state of the relevant state machine submodule changes, and simultaneously sets Reset_ClkCNT to 1 and clears ClkCNT to 0, entering the timing control of the next state. 4. Functional Simulation and Comprehensive Verification4.1 Functional Simulation of the Storage Controller The VIM-1 embedded memory was functionally simulated in Modelsim 5.7. The simulation results for read and write access functions are given below. Figure 5 shows the timing waveform of the storage controller's write cycle. This timing waveform reflects the process of the storage controller writing the data 00000009H (H represents hexadecimal) from the data bus into the DRAM. As shown in Figure 5, when the storage controller's state machine cstate is in state 0110, after 240 ns, it sends the data to the output data line sdr_odq, with a value of 00000009H. After completing the data writing, the state machine transitions to the write recovery state 0111 (260 ns), and after two cycles, returns to the idle state 0000 in 280 ns, completing the write cycle. Figure 6 shows the timing waveform of the storage controller's read cycle. Similar to the write cycle timing waveform diagram, the figure correctly reflects the state transition process of the command state machine after detecting a read command on the control bus, the delay period between state transitions, the DRAM command signal corresponding to each state, the values on the system address and data bus, and the values on the DRAM address and data lines. Through a series of state transitions, when the state machine cstate transitions to the 1010 data output state (350 ns), the memory controller reads the data 00000009H previously stored in the DRAM memory cell and transmits it to the system data bus sys_odata. 4.2 FPGA Synthesis Verification The VIM-1 embedded memory controller was synthesized in the Quartus II environment using Altera's Stratix FPGA series EP1S80. The synthesis results of the memory controller are shown below. [align=center] [/align] 5 Conclusion The VIM system architecture, which embeds vector processing logic in the PIM, can fully utilize the high bandwidth, low latency, and low power consumption characteristics of the PIM, and can effectively solve the memory performance bottleneck problem. The VIM-based embedded storage controller is the storage control component of the VIM system and is crucial to the system's performance. The VIM-1 embedded storage controller implemented in this paper supports multiple read/write modes and has strict timing control. Each storage controller corresponds to a separate storage module and has a storage cross-connect interface with VIM, enabling multiple storage components to access multiple storage banks simultaneously. This design has significant research and application value.