I. What is Artificial General Intelligence (AGI)?
Artificial General Intelligence (AGI) refers to a type of artificial intelligence that possesses intelligence comparable to or even surpassing that of humans. AGI not only performs basic cognitive abilities such as perception, understanding, learning, and reasoning like humans, but also demonstrates flexible application, rapid learning, and creative thinking across various fields.
1 Overview of AGI Development
1.1 The concept of AGI
AGI (Artificial General Intelligence), also known as Strong AI, refers to intelligence that is equal to or even surpasses that of humans, and can exhibit all the intelligent behaviors of normal humans.
ChatGPT represents a qualitative leap in the development of large-scale models, possessing a certain level of AGI (Automatic Generative Intelligence) capability. With the success of ChatGPT, AGI has become a focal point of global competition.
In contrast to AGI, which is based on large models, traditional artificial intelligence based on small to medium-sized models can also be called weak artificial intelligence. It focuses on a relatively specific business aspect, uses models with relatively small to medium parameter scales and small to medium-sized datasets, and then achieves relatively deterministic and relatively simple artificial intelligence application scenarios.
1.2 One of the characteristics of AGI: Emergence
Emergence is not a new concept. Kevin Kelly mentioned emergence in his book Out of Control, where it refers to the emergence of higher-level characteristics that transcend individual characteristics in a collection of individuals.
In the field of large models, "emergence" refers to the significant performance improvement and the emergence of amazing and unexpected capabilities when the model parameters exceed a certain scale, such as language understanding, generation, and logical reasoning.
For laypeople (such as the author himself), emergent capabilities can be simply explained as "quantitative change leading to qualitative change": as the model parameters continue to increase, they eventually break through a certain critical value, thereby causing a qualitative change and enabling the large model to generate many more powerful and new capabilities.
For a detailed analysis of the "emergent" capabilities of large language models, please refer to Google's paper "Emergent Abilities of Large Language Models".
Of course, the development of large language models is still a very new field, and there are differing opinions on the concept of "emergent" capabilities. For example, researchers at Stanford University have questioned the claim of "emergent" capabilities in large language models, arguing that it is a result of artificially chosen measurement methods. See the paper "Are Emergent Abilities of Large Language Models a Mirage?" for details.
1.3 AGI Feature Two: Multimodality
Every source or form of information can be called a modality. For example, humans have touch, hearing, and vision; information media include text, images, voice, and video; and various types of sensors include cameras, radar, and lidar. Multimodal, as the name suggests, refers to expressing or perceiving things from multiple modalities. Multimodal machine learning refers to algorithms that learn and improve themselves from data from multiple modalities.
Traditional small- to medium-sized AI models are mostly single-modal. For example, there are algorithm models that specialize in single-modal tasks such as speech recognition, video analysis, image recognition, and text analysis.
Since the emergence of chatGPT based on Transformer, most subsequent large-scale AI models have gradually implemented support for multimodal computing.
First, it can learn from multimodal data such as text, images, voice, and video;
Furthermore, the abilities learned in one modality can be applied to reasoning in another modality;
Furthermore, the capabilities learned from different modalities will be integrated to form new capabilities that go beyond the learning capabilities of a single modality.
The division into multimodal data is a human construct. The information contained in the data from multiple modalities can be uniformly understood by AGI and transformed into the capabilities of the model. In small and medium-sized models, we artificially fragment a lot of information, thereby limiting the intelligence capabilities of AI algorithms (in addition, the parameter size and model architecture of the model also have a significant impact on intelligence capabilities).
1.4 AGI Feature Three: Universality
Since deep learning entered our field of vision in 2012, AI models for various specific application scenarios have sprung up like mushrooms after rain. These include license plate recognition, facial recognition, speech recognition, and more, as well as comprehensive scenarios such as autonomous driving and metaverse scenarios. Each scenario has different models, and even for the same scenario, many companies have developed various models with different algorithms and architectures. It can be said that AI models during this period were extremely fragmented.
Starting with GPT, people saw the dawn of general AI. The ideal AI model can take in training data of any form and in any scenario, can learn almost "all" capabilities, and can make any necessary decisions. Of course, most importantly, the intelligence capabilities of AGI based on large models far exceed those of traditional small and medium-sized AI models used for specific situations.
With the emergence of fully general-purpose AI, on the one hand, we can generalize and implement AGI in various scenarios; on the other hand, as algorithms become more defined, it also provides room for continuous optimization of AI, allowing for continuous improvement of AI computing power. This continuous improvement in computing power, in turn, will drive the evolution and upgrading of models towards larger-scale parameters.
2. The relationship between specialized and general-purpose applications
Makimoto's Wave is a development pattern in the electronics industry similar to Moore's Law. It posits that integrated circuits cyclically shift between "general-purpose" and "dedicated" applications, with a cycle of approximately 10 years. Therefore, many in the chip industry believe that "general-purpose" and "dedicated" are equivalent, representing two sides of a scale. Whether a product's design and development leans towards general-purpose or dedicated applications is a trade-off based on customer scenario requirements and product implementation.
However, looking at the development of AGI, the difference between AGI based on large models and traditional specialized artificial intelligence based on small and medium-sized models is not a simple trade-off between two equal ends, but rather a matter of upgrading from lower-level intelligence to higher-level intelligence. Let's re-examine the history of computing chip development from this perspective:
Application-Specific Integrated Circuits (ASICs) are a chip architecture that has existed throughout the development of integrated circuits.
Before the advent of CPUs, almost all chips were ASICs; but after the advent of CPUs, CPUs quickly gained dominance in the chip market; the CPU's ASIC contains the most basic instructions such as addition, subtraction, multiplication, and division, and therefore the CPU is a completely general-purpose processor.
GPUs were initially positioned as dedicated graphics processors; however, since they were transformed into GP-GPUs, positioned as parallel computing platforms, and with the support of CUDA for user development, GPUs have achieved their dominant position in the heterogeneous era.
As system complexity increases, and with the differences and rapid iteration of various customer systems, ASIC-based chips become increasingly unsuitable. This has led to the rise of DSA (Digital Substance Assemblies), which can be understood as a return to general-purpose programmability from ASIC. DSA is essentially an ASIC with some programmability. While ASICs are designed for specific scenarios and fixed business logic, DSA addresses multiple scenarios within a single domain, with its business logic being partially programmable. Even so, in performance-sensitive scenarios like AI, AI-DSA has not been as successful as GPUs. The fundamental reason is that AI scenarios change rapidly, while AI-DSA chips have excessively long iteration cycles.
From a long-term development perspective, the development of application-specific integrated circuits (ASICs) is paving the way for general-purpose integrated circuits (GSPs). GSPs extract more fundamental and universally applicable computational instructions or functions from various application-specific computing methods and then integrate them into their design. For example:
CPUs are universal but have relatively weak performance, so hardware acceleration and performance improvement are achieved through coprocessors such as vectors and tensors.
CPUs have limited acceleration capabilities, which led to the development of GPUs. GPUs are general-purpose parallel acceleration platforms. However, GPUs are still not the highest-performance acceleration method, which is why Tensor Core acceleration methods emerged.
The Tensor Core approach still didn't fully unleash computational performance. Therefore, the completely independent DSA processor emerged.