Foreword
According to Huawei Global Industry Vision (GIV), the global volume of new data is projected to reach 180 ZB by 2025, far exceeding human processing capabilities, with 95% of this data relying on AI for processing. Data is a crucial asset for enterprises, and leveraging artificial intelligence for more efficient data analysis, processing, and decision-making to improve enterprise productivity and intelligence will become a core task for business operations. It is projected that by 2025, the global adoption rate of AI will reach 86%, and the rise of AI will profoundly change business models and value creation models.
Despite several ups and downs in its development over the past 60 years, artificial intelligence has consistently achieved new breakthroughs driven by emerging ICT information technologies. However, in recent years, CPU performance has failed to double as predicted by Moore's Law, leading to the widespread belief in the industry that Moore's Law has become obsolete. The ability to develop chips with ultra-high computing power that meet market demands has become a crucial factor for the sustainable development of the artificial intelligence field.
Starting with AlphaGo's victory over Lee Sedol
In 2016, Google's AlphaGo and Go world champion Lee Sedol staged the "Match of the Century," pushing the focus on artificial intelligence to unprecedented heights. The AI robot AlphaGo defeated professional Go player Lee Sedol with a score of 4-1. This match consumed the computing resources of 1202 CPUs and 176 GPUs from Google's DeepMind, and AlphaGo's floating-point performance was more than 30,000 times that of IBM's Deep Blue when it defeated the Chinese chess champion in 1998.
(Figure 1: AlphaGo playing against Lee Sedol)
But from an energy efficiency perspective, did AlphaGo truly defeat humans? Let's analyze this from the following aspects. An adult male needs approximately 2550 kilocalories per day. 1 kilocalorie (kcal) = 4.184 kilojoules (kJ). If we convert calories to joules, it's roughly over 10 million joules. In one hour of playing Go, Lee Sedol consumed approximately 0.7 megajoules. AlphaGo used 1202 CPUs and 176 GPUs to play against Lee Sedol. Assuming 100W per CPU and 200W per GPU, AlphaGo needed 559 megajoules per hour (1 watt-hour = 3600 joules). This means Lee Sedol's energy consumption was approximately one eight-hundredth of AlphaGo's.
Afterwards, Google's DeepMind team improved the hardware, replacing the computing unit with a GPU. In the same level of competition, AlphaGo's energy consumption decreased by 12 times, but it was still 67 times more energy-consuming than humans.
Therefore, we see that while GPUs offer significant performance and efficiency improvements over CPUs, they remain more suitable for large-scale distributed training scenarios. With the development of 5G, IoT , cloud computing, and ultra-wideband information technologies, intelligence will extend to every smart device and terminal, including various forms of edge computing, IoT, and consumer smart terminals. To achieve the ultimate user experience, these devices are often located closest to the user and require long standby times, placing very high demands on power consumption and space constraints. Clearly, GPUs cannot meet the needs of these scenarios.
The essence of artificial intelligence is to help various industries improve productivity and generate social and commercial value. Relying on massive and expensive computing resources to achieve a simple scenario, like AlphaGo, is a waste of resources. From our understanding of AI needs, the development of AI chips must consider covering all intelligent needs across all scenarios from the outset, whether it's cloud, edge, or terminal; whether it's deep learning training, inference, or both, rather than a single chip trying to solve everything. The development history of AI chips also shows a gradual adaptation to this process.
Redefining AI Chips
Artificial intelligence chips have gone through a development process from CPU to GPU to FPGA to AI chip.
AlphaGo's first victory over a human was achieved with tremendous effort, essentially based on the von Neumann architecture. Consequently, GPUs, with their powerful parallel and floating-point computing capabilities, became the standard for deep learning model training and inference. Compared to CPUs, GPUs offer faster processing speeds, require less server investment, and consume less power, making them the mainstream mode for deep learning training in recent years.
However, GPUs cannot meet the computational needs of all deep learning scenarios. For example, as mentioned earlier, Level 4 autonomous driving requires recognizing roads, pedestrians, traffic lights, and other conditions. If CPU-based computation is used, the latency cannot meet the requirements; the car might overturn into a river before even realizing it's there. While GPU computation can meet the latency requirements, its high power consumption prevents car batteries from operating for extended periods. Furthermore, a single GPU card costs anywhere from tens of thousands to nearly one hundred thousand RMB, making it unavailable to most ordinary consumers.
Essentially, GPUs are not ASICs specifically developed for AI algorithms. There is an urgent need to find chips that can solve the computing power problems of deep learning training and inference, as well as the problems of power consumption and cost. FPGA chips were born in this context.
FPGA (Field-Programmable Gate Array) emerged as a semi-custom circuit in the ASIC field. It is essentially an architectural innovation based on instruction-free and memory-free architecture, meeting the needs of specific scenarios.
FPGAs primarily improve performance, reduce latency, reduce power consumption, and lower costs through the following methods:
A large number of gate circuits and memory connections are defined by burning in configurable and rewritable FPGA configuration files.
By configuring the FPGA, it can be transformed into different processors to support a variety of deep learning computing tasks.
In an FPGA, registers and on-chip memory belong to their respective control logic, and there is no need for unnecessary clipping and caching.
Research has found that for large-scale matrix operations, GPUs have significantly higher computing power than FPGAs. However, due to the characteristics of the FPGA architecture, it is well-suited for low-latency, streaming computation-intensive tasks. In cloud inference scenarios with massive concurrency, such as speech cloud recognition, FPGAs offer the advantage of lower computational latency compared to GPUs, providing a better consumer experience.
However, FPGA chips essentially improve performance through pre-programming. AI, on the other hand, often processes large amounts of unstructured data, such as videos and images, which are difficult to process satisfactorily using pre-programming methods. Instead, it requires AI chips to perform extensive sample training and inference interactions to form an algorithm model. Only then can intelligent devices integrating AI chips and algorithms possess intelligent reasoning capabilities.
While both GPUs and FPGAs can run AI algorithms, they each have their shortcomings. GPUs are not ASICs specifically designed for AI algorithms, resulting in high power consumption and cost; while FPGAs offer some architectural innovation, their pre-programming is cumbersome. Strictly speaking, neither is an AI chip. So, what is an AI chip? We know that the data processing characteristics of deep learning algorithms in artificial intelligence require chips with performance 2-3 orders of magnitude higher than traditional computing. Based on the above analysis, we attempt to provide the following definition:
Based on ASICs (Application-Specific Integrated Circuits), these are dedicated chips that can be flexibly defined and highly customized through software. On the one hand, they must be capable of performing deep learning neural network operations; on the other hand, they must improve the efficiency of deep learning operations through innovation in hardware computing architecture, achieving optimal energy efficiency (TOPS/W). Only chips that achieve this can be called AI chips.
It is worth acknowledging that FPGA has boldly taken the first step in the innovation of artificial intelligence chip hardware architecture, namely the ASIC application-specific integrated circuit model.
AI chips rely on architectural innovation
As analyzed above, the reason why FPGAs consume less power than CPUs and GPUs is essentially due to the benefits of their instruction-free and memory-free architecture. Before discussing architectural innovation, let's analyze why CPUs/GPUs cannot meet the demands of artificial intelligence.
Currently, most AI chips on the market adopt a CPU-like architecture (a partial optimization of the von Neumann architecture), which is essentially still a "computation-first" model. For example, it improves chip processing performance by expanding parallel computing units. However, in the training of deep learning neural network algorithms for artificial intelligence, multiple computing units often require frequent memory read and write operations. The CPU-like architecture is essentially still a shared memory model, which cannot fundamentally solve the storage performance bottleneck problem caused by the shared memory model of the von Neumann computing architecture, also known as the "memory wall". A CPU-like architecture is illustrated below:
(Figure 2: CPU-like chip architecture)
The data processing characteristics of deep learning neural networks are characterized by high concurrency, high coupling, and the "three highs" of "high concurrency + high coupling". Algorithm processing requires: massive computation, extensive parallel processing, and low-latency operation. Taking training as an example, the training process involves large amounts of data storage, placing very high demands on memory capacity, memory access bandwidth, and memory management methods. It requires chips to have a certain level of floating-point arithmetic capability and simultaneously support forward and backward computation processes and multiple iterations. Secondly, the training process requires continuous adjustment of the parameters (weights) in the neural network, including multiple parameter inputs and reads, as well as complex data synchronization requirements. The frequent parameter operations throughout the online training process pose a significant challenge to the memory.
Essentially, the von Neumann computing architecture is the root cause of the failure of Moore's Law in the context of artificial intelligence. Overcoming the "memory wall" bottleneck through hardware architecture innovation to achieve optimal deep learning algorithm computational efficiency has become the direction for innovation and development of AI chip architecture.
AI chip architecture design needs to meet the following requirements:
It meets the basic requirements of deep learning neural network operations, whether it is training, inference, or the collaboration between the two, and must meet the requirements of actual commercial scenarios in terms of data accuracy, scalability, extensibility, and power efficiency.
It supports "near-data computing," which uses hardware architecture design to shorten the distance between computation and storage, reducing the number of data moves and lowering energy consumption. For example, it supports performing neural network calculations on on-chip memory.
It supports flexible scaling and clustering, and supports large-scale distributed parallel AI training. For example, parallel computing units are interconnected through ultra-bandwidth networks.
It supports software-defined AI chips, meeting the personalized customization and combined application needs of most complex AI algorithms, achieving marginal benefits through widespread application, and reducing the cost of AI chips.
Huawei Da Vinci AI Chip Architecture Introduction
In response to market trends and leveraging its years of chip R&D experience, Huawei launched the globally unique Da Vinci AI chip architecture in October 2018. Based on this architecture, Huawei released its full-stack, all-scenario AI solutions and the first batch of Ascend series chips. Notably, the Da Vinci architecture is designed specifically for AI computing characteristics, using a high-performance 3DCube computing engine to significantly improve computing power and energy efficiency. Addressing the independent and collaborative AI needs of cloud, edge, and device, and catering to AI scenarios ranging from extremely low power consumption to extremely high computing power, it provides a unified underlying core support for algorithm collaboration, migration, deployment, upgrades, and maintenance across cloud, edge, and device. This significantly lowers the barriers to AI algorithm development and iteration, and reduces the deployment and commercialization costs for enterprise AI. In short, the unified and scalable Da Vinci AI chip architecture provides strong support for Huawei's comprehensive AI strategy, ensuring that AI is affordable, usable effectively, and reliable across all scenarios.
The Da Vinci architecture is as follows:
(Figure 3: Huawei Da Vinci chip architecture)
Unlike the traditional von Neumann architecture, where data is retrieved from external memory and written back after processing, the da Vinci architecture was designed from the outset to overcome the "memory wall" problem caused by the von Neumann architecture. Building upon a CPU-like architecture (essentially computation-first), it further innovated and optimized storage complexity (storage-first). As shown in Figure 3, on one hand, it expands parallel computing capabilities through multi-core stacking; on the other hand, it designs on-chip memory (Cache/Buffer) to shorten the distance between cube operations and storage, reducing access to DDR memory and alleviating the von Neumann "bottleneck" problem. Furthermore, a high-bandwidth off-chip memory (HBM) is designed between computation and external storage to overcome the access speed limitations of shared memory read/write operations. Simultaneously, to support larger-scale cloud-side neural network training, an ultra-high-bandwidth mesh network (LSU) is designed to achieve on-chip interconnection of multiple cubes.
In summary, the da Vinci architecture has three main characteristics:
Unified architecture
Supports a full range of AI chips from tens of milliwatts to hundreds of watts. (See Figure 4)
Scalable computing
Each AIcore can perform 4096 MAC operations per clock cycle.
Flexible multi-core stacking, scalable Cube: 16x16xN, N=16/8/4/2/1
Supports multiple mixed-precision (int8/int32/FP16/FP32) methods to meet the data precision requirements of both training and inference scenarios.
It integrates multiple computational units such as tensors, vectors, and scalars.
Scalable memory
Dedicated and distributed, explicitly controlled memory distribution design
4TByte/s L2Buffer cache
1.2TByte/sHBM high-bandwidth memory
Scalable on-chip interconnects
On-chip ultra-high bandwidth mesh network (LSU)
Based on the Da Vinci innovative architecture, Huawei has launched its first batch of Ascend 910 (7nm) and Ascend 310 (12nm). The Ascend 910 chip is currently the world's highest single-chip computing density chip. It supports large-scale distributed training scenarios on the cloud. If 1024 Ascend 910 chips are assembled, it will create the world's largest AI computing cluster to date, with a performance of 256 petaflops, easily training even the most complex models.
The Ascend310 chip is an AI SoC with high computing power and low power consumption for edge computing inference scenarios.
Based on the Da Vinci architecture, Huawei has also planned the Ascend chip series (Figure 4: Nano, Tiny, Lite) for Bluetooth headsets, smartphones, and wearable devices. In the future, these chips will be integrated with other chips via IP to serve various smart products. Currently, AI chips on the market typically consist of two types: cloud training and edge inference chips. Huawei's consideration of chips like Lite is primarily due to the requirement for extremely low power consumption in some AI application scenarios.
Furthermore, the da Vinci AI chip architecture takes into account the capabilities of software-defined AI chips. CANN (shown in Figure 4)—a highly automated operator development tool for the chip—is a computational architecture customized for neural networks. CANN can improve development efficiency by 3 times. In addition to efficiency, it also considers operator performance to adapt to the rapid development of artificial intelligence applications.
(Figure 4: Huawei's full-stack, full-scenario AI architecture)
In terms of design, the Ascend chip series has broken through constraints such as power consumption and computing power, achieving a significant improvement in energy efficiency (see Figure 5). Taking the Ascend 910 chip as an example, its half-precision (FP16) computing power is 256 TFLOPS, which is twice that of NVIDIA's Tesla V100, and its integer precision (INT8) is 512 TOPS, with a maximum power consumption of only 350W. The Ascend 310 chip focuses on extremely efficient computing and low power consumption, with a half-precision (FP16) computing power of 8 TFLOPS, an integer precision (INT8) of 16 TOPS, and a maximum power consumption of only 8W. The TOPS/W (energy efficiency) of the 310 is more than twice that of NVIDIA's similar chip NVP4.
(Figure 5: Huawei Ascend series chips achieve optimal TOPS/W across all scenarios)
It should be noted that Huawei does not directly supply chips to third parties, so Huawei does not directly compete with chip manufacturers. Huawei provides hardware and cloud services, and develops and sells AI acceleration modules, AI acceleration cards, AI servers, AI all-in-one machines, and MDC (Mobile-DC) for autonomous driving and intelligent driving based on chips.
Thinking behind the Da Vinci architecture
Unlike traditional information technology, the purpose of AI in bringing intelligence is to reduce enterprise production costs and improve efficiency. This means that AI applications will transcend information technology and penetrate deep into enterprise production systems. Once inside the production system, it must be integrated with various offline and local scenarios. Therefore, this is why the Da Vinci architecture was designed from the outset with the goal of meeting the ultra-dynamic and ultra-wide-ranging needs of AI.
However, Huawei's Da Vinci architecture only made some minor innovations while standing on the shoulders of giants, and still faces huge technical difficulties and challenges to be overcome:
Although chip manufacturing processes have reached the nanometer scale, further increases in integration density in more complex artificial intelligence fields such as brain-like computing, gene sequencing, and the development of new anti-cancer drugs will lead to atomic-level ionization leakage problems. This is precisely why industry giants are investing heavily in quantum physics.
While there is a consensus on alleviating the von Neumann bottleneck, SRAM, the only solution for on-chip memory tightly coupled to the computing core, has a capacity in the megabyte range. Further innovation in memory device technology itself is still needed.
The storage-first mode requires consideration of the encapsulation technology of multiple on-chip storage devices and the management of multiple on-chip storage devices, which further increases the complexity of the software.
In the future, in the field of brain-like intelligence (in the extreme case where AlphaGo consumes the same amount of energy as humans), the energy consumption requirement will be several orders of magnitude lower than that of the most advanced CMOS devices.
Therefore, we believe that Huawei has made initial progress in the development of artificial intelligence chip technology, but the engineering challenges faced by AI chips and architecture design, especially neural network chips, are far from over.