1ChatGPT and the Post-Moore Era
In 2023, a landmark event occurred in the field of artificial intelligence: OpenAI released ChatGPT, a chatbot based on a large language model . This chatbot can respond to human commands and perform various tasks, from writing articles and solving math problems to debugging code. The release of ChatGPT refreshed people's understanding of AI, marking the commercialization of generative artificial intelligence. It not only changed the way AI research and technology development are conducted but also had a profound impact on society. However, artificial intelligence is not a new technology. It originated in the 1960s and, after more than half a century of development, has undergone three waves of development: symbolism, connectionism, and agent-based learning. Currently, it is generally believed that artificial intelligence = deep learning + large-scale computing + big data. Deep learning is a special type of machine learning that requires a large amount of data as a foundation. It obtains various parameters (models) through "training" and then uses the trained model to infer the final result. Therefore, the more parameters a model has, the greater the computing power required for training and inference. With the development of deep learning, the demand for computing power in the AI field is growing at a rate of more than 10 times per year. Taking ChatGPT as an example, its initial version was based on the giant model GPT-3, which had 175 billion parameters, while the latest version is based on GPT-4, which has an astonishing 1.76 trillion parameters (according to online sources).
The realization of artificial intelligence requires computing power, which in turn requires the support of chips. This is crucial for the development and industrialization of AI. Taking GPT-3 as an example, with 175 billion parameters and a corpus of 100 billion words, it would require 1,000 NVIDIA A100 GPUs for training for one month. In 2023, a significant event occurred in the chip industry: on March 24th, Gordon Moore, the proposer of Moore's Law, passed away at the age of 94. Moore made his famous prediction about the development of integrated circuits in 1965: the number of transistors that can be placed on an integrated circuit roughly doubles every 18 to 24 months, meaning that processor performance roughly doubles every two years while the price halves. This is the renowned Moore's Law.
Although Moore's Law is not a formally defined scientific law, but rather Moore's generalization of observed trends, it has successfully predicted the development trend of integrated circuits for half a century since its inception. Taking Intel as an example, from 1971 to 2008, the number of transistors on Intel microprocessor chips doubled every two years, while feature size shrank by 15% annually, halving every five years. Benefiting from the reduction in feature size, clock frequency can be significantly increased even with unchanged hardware architecture . Again, using Intel as an example, from 1990 to 2002, the clock frequency of its microprocessors doubled in less than two years, although this also included improvements from architectural upgrades.
If this trend had continued, processor clock frequencies would have reached 30GHz by 2008. However, in reality, Intel processor clock frequency growth gradually slowed after 2002, peaking in 2005. In November 2004, Intel announced the cancellation of its 4GHz Pentium processor plans, shifting its focus to multi-core architectures. Indeed, while Moore's Law has painted a bright future for integrated circuit development for over half a century, limitations imposed by physical effects, power consumption, and other factors mean it cannot continue indefinitely. Regarding physical effects, as process nodes shrink, transistor sizes approach atomic scales, and quantum and noise effects can affect transistor operation. For example, when gate lengths are sufficiently short, quantum tunneling occurs, leading to increased leakage current, as well as increased power consumption and temperature.
Furthermore, as the number of atoms in transistors decreases, factors such as impurity fluctuations, interface roughness, and lattice mismatch can cause performance differences between transistors. Regarding power consumption, as integration density increases, the number of transistors and clock frequency on a chip also increase, leading to more severe power consumption and heat dissipation issues. Power consumption mainly includes static power consumption and dynamic power consumption.
Static power consumption refers to the power consumed by leakage current in a transistor when it is off, and it is related to the quantum tunneling effect. Dynamic power consumption refers to the power consumed by a transistor during switching due to capacitor charging and discharging, and it is related to clock frequency and voltage. In addition, economic efficiency is also a factor to consider. As process nodes advance, the costs of equipment, materials, and labor required to manufacture chips continue to increase, which affects chip prices and market competitiveness.
More than a decade before Moore's death, the industry recognized that Moore's Law was slowing down and might even be broken, leading to the concept of the post-Moore era, which sought new technological paths for future integrated circuit development. Currently, the industry has proposed four main development directions: Continuing Moore , Extending Moore, Beyond Moore, and Much Moore. Because chip clock frequencies cannot be further increased, processor design has shifted from single-core overclocking to multi-core parallel processing. By providing multiple identical cores, computational tasks are distributed across different cores for simultaneous computation, thereby improving processing performance. However, as the scenarios and tasks processors face become increasingly complex, different tasks may have different performance and energy efficiency limitations.
No single processor architecture is suitable for all scenarios. Therefore, the design of multi-core processors has gradually shifted from homogeneous to heterogeneous, meaning that the cores in the processor have different architectures, such as some being high-performance, some low-power, or some being general-purpose and some being specialized.
AI chips in the post-Moore's Law era
As mentioned earlier, AI applications exemplified by ChatGPT require immense computing power, which, as one of the three essential elements of artificial intelligence, necessitates the support of AI chips. While broadly speaking, any chip designed for AI applications can be called an AI chip, it is generally believed that an AI chip is specifically designed to accelerate AI algorithms . Because deep learning demands high parallel computing capabilities, and CPU architecture often cannot fully meet the high-performance parallel computing requirements of artificial intelligence, there is a need to develop dedicated chips suitable for AI algorithms.
Currently, common AI acceleration chips can be categorized into three types based on their technology: GPUs, FPGAs , and ASICs . 1) GPUs: Composed of thousands of smaller, more efficient cores forming a massively parallel computing architecture, suitable for high-volume parallel computing. 2) FPGAs: Semi-custom chips, highly flexible and integrated, but with low computational load and high mass production costs, suitable for specialized fields with frequent algorithm updates. 3) ASICs: Domain-specific chips, highly specialized, with long development cycles and extremely high difficulty, suitable for specialized fields with high market demand. The table below compares the advantages and disadvantages of the three in more detail:
While CPUs cannot meet the performance requirements of AI algorithms and therefore cannot be used as dedicated AI chips, real-world AI applications actually require CPU participation. This is because CPUs possess general-purpose processing capabilities that other AI-specific chips lack. In AI applications, data preprocessing, flow control of the computation process, and post-processing of computation results all require the general-purpose processing capabilities of CPUs. As mentioned earlier, in the post-Moore's Law era, processor designs were primarily multi-core heterogeneous, with each processing unit fully utilizing its strengths and cooperating to efficiently complete computations. As a representative of post-Moore's Law chip design, AI processors naturally also need to adopt this heterogeneous multi-core design approach. Of course, different AI processors target different scenarios, and their specific heterogeneous designs also differ.
Taking edge AI processors as an example, the scenarios they address require low power consumption, high performance, and real-time data processing. Therefore, a traditional SoC design combined with a dedicated AI processor (ASIC) can be used. The CPU and peripherals in the SoC provide general processing and I/O interaction capabilities, while the dedicated AI processor accelerates AI algorithms. The combination of the two achieves a balance between high performance and low power consumption in AI computing scenarios. However, the drawback is that while dedicated AI processors offer high performance, they lack flexibility. The algorithms they support are determined at the time of design completion and cannot be flexibly added later. Furthermore, AI algorithms are evolving rapidly, with new operators emerging constantly, which may be difficult to handle relying solely on AI processors.
Adding another FPGA to this system would greatly improve flexibility. If unsupported algorithms or unmet (IO) performance requirements are encountered , they can be easily addressed through on-site custom development using the FPGA's programmable logic. 3FPAI = FPGA + SoC + AI As mentioned above, for edge AI processors, a design combining FPFA, SoC, and a dedicated AI processor can balance versatility, flexibility, and energy efficiency. We might call this architecture FPAI, i.e., FPAI = FPGA + SoC + AI. While this architecture is good, the integration of FPGAs makes actual design and manufacturing quite challenging. Fortunately, a domestic manufacturer has taken the lead and launched an AI processor using the FPAI architecture. The chip's architecture is shown in the following figure:
The chip mainly consists of the following three parts:
1) Processor System: Corresponding to the SoC in the FPAI architecture, it mainly includes multi-core CPU/GPU/VPU processors, buses, storage units, some general interfaces and other functions.
2) AI Engine: Corresponding to the AI-dedicated processor in the FPAI architecture, it includes a Matrix Processing Engine (MPE), a Vector Processing Engine (VPE), on-chip storage, and some other computing engines. The MPE is mainly used for multiply-accumulate calculations, and its main computing unit is a 32×32 MAC array; the VPE is mainly used for linear vector calculations and operations such as activation and pooling; the on-chip storage is used to cache intermediate data and alleviate bandwidth pressure. 3) Programmable Logic: Corresponding to the FPGA in the FPAI architecture, it includes programmable logic resources (BRAM, LUT, DSP ), high-speed interfaces (GTH, ETH, PCIE), and DDR , etc.
This AI processor supports two computational precisions: INT8 and INT16, providing computing power of 27.5 TOPS and 6.9 TOPS respectively. Running a Yolov5s network , it took 6.28ms, with a floating-point precision of 0.568, a quantized INT8 precision of 0.547, and an INT16 precision of 0.561.
The multi-core heterogeneous design of processors introduces significant complexity to programming. Therefore, a good AI processor not only needs good performance and energy efficiency but also a user-friendly compiler to easily deploy upper-layer AI applications to accelerate their operation. The aforementioned FPAI architecture processor provides a powerful and flexible AI compiler, " Ic raft," whose overall architecture is as follows:
Icraft mainly consists of the following components:
1) Front-end parsing: Parses the model in the AI framework to the middleware layer of Icraft. Supported front-end frameworks: PyTorch, Tensorflow , ONNX, Caffe, Darknet
2) Quantization & Optimization: Quantize and optimize the intermediate layer network parsed from the framework, adapting it step-by-step to the AI processor. 3) Instruction Generation: Convert operators into instruction sequences for the AI engine. 4) Simulation & Execution: Simulate the intermediate layer network or deploy the compiled network to the AI processor for execution. 5) Analysis & Evaluation: Analyze and evaluate the network's speed, efficiency, etc., providing a reference for performance optimization. Icraft provides strong support for the FPGA portion of the FPAI architecture. Users can program their own acceleration logic on the FPGA and add it to the compilation process through Icraft's custom operator interface. This allows users to choose to accelerate any operator through FPGA programming, flexibly meeting the needs of different scenarios. Due to space limitations, the specific custom operator process will be discussed in a later article.
Tactical Summary
Today, we mainly discussed the importance of heterogeneous multi-core processor design in the post-Moore's Law era. We also introduced the advantages of the FPAI (FPGA + SOC + AI) architecture for edge AI processor design, and specifically described the hardware and software design of an accelerator based on the FPAI architecture that is already on the market. If you're interested in this FPAI chip, feel free to message me to discuss it. I'll invite tech experts to answer your questions as soon as possible!