What factors need to be considered when designing an AI chip?

It's so important that Google is currently being sued for allegedly infringing on the rights of BF16's creators, seeking between $1.6 billion and $5.2 billion in damages. All eyes are on the digital formats, as they have played a significant role in improving the efficiency of AI hardware over the past decade. Lower-precision digital formats have helped break down the memory walls of models with billions of parameters.

In this paper, we will explore the technical aspects of neural network quantization, starting from the fundamental principles of digital formats and discussing current techniques. We will cover floating-point versus integer, circuit design considerations, block floating-point, MSFP, microscaling formats, and logarithmic systems. We will also discuss the differences between quantization and digital formats for inference, as well as high-precision versus low-precision training methods. Furthermore, we will discuss the next steps for models facing challenges related to quantization and accuracy loss.

The above are all factors that need to be considered when designing an AI chip.

01. Matrix Multiplication

Most of any modern machine learning model involves matrix multiplication. In GPT-3, each layer uses a large number of matrix multiplications: for example, one specific operation is a (2048 x 12288) matrix multiplied by a (12288 x 49152) matrix, outputting a (2048 x 49152) matrix.

The key is how to compute each individual element in the output matrix, which boils down to the dot product of two very large vectors – in the example above, the size is 12288. This consists of 12288 multiplications and 12277 additions, which accumulate into a single number – the individual element of the output matrix.

Typically, this is done in hardware by initializing the accumulator register to zero and then repeating the process.

Multiply by x_i * w_i;

Add it to the accumulator;

The throughput for each cycle is 1. After approximately 12,288 cycles, the accumulation of a single element of the output matrix is complete. This fused multiply-add (FMA) operation is the fundamental computational unit for machine learning: thousands of FMA units on the chip are strategically arranged to efficiently reuse data, thus allowing many elements of the output matrix to be computed in parallel, reducing the number of cycles required.

All the numbers in the above diagram need to be represented bitwise in some way within the chip:

x_i, input to activate;

w_i, weight;

p_i, pairwise product;

The entire output is a sum of all intermediate parts before the final summation.

The final output is the sum;

Within this vast design space, most current quantitative research in machine learning can be reduced to two goals:

It's possible to store hundreds of billions of weights accurately while using as few bits as possible, reducing memory footprint in terms of capacity and bandwidth. This depends on the digital format used to store the weights.

Achieving good energy and area efficiency depends primarily on the digital format used for weights and activations;

These goals are sometimes consistent and sometimes inconsistent—we will examine these two goals in more detail.

02. Digital Format Design Goal 1: Chip Efficiency

The fundamental limitation to the computational performance of many machine learning chips is power consumption. While the H100 can theoretically achieve 2,000 TFLOPS of computing power, it encounters power limitations before reaching that level – therefore, FLOPs per joule of energy is a very important metric to track. Given that modern training runs now often exceed 1e25 FLOPs, we need extremely efficient chips that can suck up megawatts of power over months to beat the state-of-the-art (SOTA).

03. Basic Number Format

First, let's delve into the most fundamental number format in computation: integers. I. Positive Integers Base 2 Positive integers have a distinct base-2 representation. These are called UINTs, or unsigned integers. Here are some examples of 8-bit unsigned integers (also known as UINT8, ranging from 0 to 255).

These integers can have any number of digits, but typically only the following four formats are supported: UINT8, UINT16, UINT32, and UINT64.

II. Negative integers

Negative integers require a sign to distinguish between positive and negative. We can place an indicator in the most significant bit: for example, 0011 represents +3, and 1011 represents –3. This is called sign-magnitude representation. Here are some examples of INT8, ranging from –128 to 127. Note that because the first bit is the sign, the maximum value is effectively halved from 255 to 127. Sign-magnitude representation is intuitive but inefficient—your circuits must implement drastically different addition and subtraction algorithms than those for unsigned integers without a sign bit. Interestingly, hardware designers can solve this problem by using two's complement representation, which allows the exact same carry adder circuitry to be used for positive, negative, and unsigned numbers. All modern CPUs use two's complement. In unsigned int8, the largest number 255 is 11111111. If you add the number 1, 255 overflows to 00000000, which is 0. In signed int8, the smallest number is -128 and the largest number is 127. As a trick resource to allow INT8 and UINT8 to share hardware, -1 can be represented as 11111111. Now, when the number is incremented by 1, it overflows to 00000000, which is expected to represent 0. Similarly, 11111110 can be represented as -2.

Overflow is used as a function! In fact, 0 to 127 are mapped to normal values, and 128 to 255 are directly mapped to -128 to -1.

04. Fixed Point

To go a step further, we can easily create new number formats on existing hardware without any modifications. While these are all integers, you can simply think of them as multiples of other things! For example, 0.025 is just 25 per thousand, which can be stored as the integer 25. Now we just need to remember that all numbers used elsewhere are per thousandths. The new "number format" can represent numbers from -0.128 to 0.127 per thousand without any actual logical change. The complete number is still treated as an integer, and the decimal point is fixed at the third place from the right. This strategy is called a fixed point. More generally, this is a useful strategy that we will revisit several times in this article—if you want to change the range of numbers that can be represented, add a scaling factor somewhere. (Obviously, you could do this with binary, but decimal is easier to discuss).

05. Floating Point

However, fixed-point numbers have some drawbacks, especially for multiplication. Suppose you need to calculate one trillion multiplied by one trillionth of a fraction—the huge difference in size is an example of high dynamic range. Both 10¹² and 10⁻¹² must be represented in our number format, so it's easy to calculate how many bits you need: counting from 0 to 1 trillion in trillionth increments, you need 10²⁴ increments, log₂(10²⁴) ~= 80 bits to represent the dynamic range with the precision we want. 80 bits per number is clearly quite wasteful. You don't necessarily care about absolute precision; you care about relative precision. Therefore, while the above format can accurately distinguish between 1 trillion and 999,999,999,999.999999999999, you generally don't need to. Most of the time, you care about the amount of error relative to the size of the number. This is precisely the problem that scientific notation solves: in the previous example, we can write one trillion as 1.00 * 10^12 and one trillion as 1.00 * 10^-12, requiring far less storage space. It's more complex, but it allows you to represent extremely large and extremely small numbers in the same context without worrying about storage. Therefore, in addition to the sign and value, we now have an exponent. IEEE 754-1985 standardized the industry-wide way of storing this data in binary, while the format used at the time was slightly different. The most interesting format, 32-bit floating-point numbers (“float32” or “FP32”), can be described as (1, 8, 23): 1 sign bit, 8 exponent bits, and 23 mantissa bits.

A sign bit of 0 indicates positive, and 1 indicates negative;

The exponent is interpreted as an unsigned integer e, representing a scaling factor of 2e-127, with values ranging from 2-126 to 2127. More exponents mean a larger dynamic range;

The mantissa represents the value 1. More mantissas mean higher relative precision;

Other bit widths have been standardized or are in fact adopted, such as FP16 (1,5,10) and BF16 (1,8,7). The focus of the debate is range versus precision.

FP8 (1,5,2 or 1,4,3) has recently had some additional quirks standardized in the OCP standard, but nothing is finalized yet. Many AI hardware companies have implemented chips with slightly superior variants that are incompatible with the standard.

06. Silicon Efficiency

Returning to hardware efficiency, the digital format used has a significant impact on silicon area and required power.

I. Integer Silicon Design Circuit

Integer adders are one of the most thoroughly studied silicon design problems ever. While actual implementations are far more complex, one way to think about adders is to imagine them as adding 1s as needed, all the way to a sum. So, in a sense, an n-bit adder is doing a certain amount of work up to n. For multiplication, recall long multiplication from elementary school. We multiply n bits by 1-bit products and then sum all the results at the end. In binary, multiplying by a 1-bit number is straightforward (0 or 1). This means that an n-bit multiplier is essentially composed of n repetitions of an n-bit adder, so the workload is proportional to n^2. While practical implementations vary considerably due to area, power, and frequency constraints, generally 1) multipliers are much more expensive than adders, but 2) at lower bit depths (8 bits and below), FMAs have higher power and area costs, as well as a greater relative contribution from adders ((n vs. n^2 scaling)).

II. Floating Point Circuits

Floating-point units are quite different. In contrast, product/multiplication is relatively simple. The sign is negative if exactly one of the input signs is negative, otherwise it's positive. The exponent is the sum of the integers passed to the exponent. The mantissa is the product of the integers passed to the mantissa. The summation, by comparison, is considerably more complex. First, the difference between the exponents is calculated (assuming exp1 is at least as large as exp2 – if not, swap them in the description); the mantissa 2 is shifted down (exp1 - exp2) to align with the mantissa 1; an implicit leading 1 is added to each mantissa. If one of the mantissas is negative, perform two's complement on one of the mantissas; add the mantissas together to form the output mantissa; if overflow occurs, increment the resulting exponent by 1 and shift the mantissa down; if the result is negative, convert it back to an unsigned mantissa and set the output sign to negative; normalize the mantissa to have leading 1s, then remove the implicit leading 1s; round the mantissa appropriately (usually to the nearest even number); it is worth noting that floating-point multiplication is even "less expensive" than integer multiplication because there are fewer bits in the mantissa product, and the adder for the exponent is much smaller than the multiplier and almost negligible. Obviously, this is also extremely simplified, especially considering the significant space consumption of non-regular and NaN handling, which we haven't discussed. The key point is that in low-bit floating-point operations, multiplication is cheap, while accumulation is expensive.

All the parts we mentioned are quite obvious here – adding the exponents, a large array of multipliers for the mantissas, shifting and aligning things as needed, and then normalizing. (Technically, a true “fused” multiply-add is a little different, but we'll omit that here.)

This chart illustrates all the points above. There's a lot to digest, but the key takeaway is that INT8 x INT8 accumulation and accumulation to fixed-point (FX) is the cheapest and is dominated by multiplication ("mpy"), while using floating-point as operands or accumulation formats (often largely) is dominated by accumulation costs ("alignadd" + "normacc"). For example, significant cost savings can be achieved by using FP8 operands with "fixed-point" accumulators instead of the usual FP32. In summary, this paper and other papers have consistently claimed that FP8 FMA will occupy 40-50% more silicon area than INT8 FMA and consume equally more or worse energy. This is the primary reason why most dedicated ML inference chips use INT8.

07. Design Goal 2 for Number Formats: Accuracy

Since integers are always cheaper, why don't we use INT8 and INT16 everywhere instead of FP8 and FP16? This depends on how accurately these formats can represent the numbers actually displayed in the neural network. We can think of each number format as a lookup table. For example, a very silly 2-digit number format might look like this:

Clearly, this set of four numbers is of little use to anything because it's missing too many numbers—in fact, there are no negative numbers at all. If a number in the neural network doesn't exist in the table, all you can do is round it to the nearest entry, which introduces a little error into the neural network. So what is the ideal set of values for the table? How small can the table be? For example, if most values in the neural network are close to zero (which is actually the case), we'd like to have many of these numbers close to zero so we can gain higher accuracy by sacrificing some. Where are they missing? In practice, neural networks are often normally or Laplace distributed, and sometimes, depending on the exact numerical values of the model architecture, there are a large number of outliers. In particular, for very large language models, there tends to be extreme outliers that are rare but important to the model's functionality.

The image above shows the partial weights of LLAMA 65B. It looks very much like a normal distribution. If you compare this to the number distributions in FP8 and INT8, it's clear that floating points are concentrated where they matter—close to zero. That's why we use it!

However, it still doesn't quite match the true distribution—it's still a bit too sharp with each exponential increment, but it's much better than int8. Can we do better? One approach to designing the format from scratch is to minimize the mean absolute error—the average amount of loss caused by rounding.

08. Log Number Systems

For example, Nvidia claimed at HotChips that the Log Number System is a possible way to continue extending the past 8-digit number formats. While logarithmic systems generally offer smaller rounding errors, they also present several problems, including extremely expensive adders.

NF4 and its variant (AF4) are 4-bit formats that assume the weights follow a perfectly normal distribution and use an exact lookup table to minimize error. However, this approach is very expensive in terms of area and power consumption—each operation now requires looking up a huge table of entries, which is far worse than any INT/FP operation. Several alternative formats exist: posits, ELMA, PAL, etc. These techniques claim several advantages in computational efficiency or representation accuracy, but have not yet reached a commercially viable scale. Perhaps one of them, or one yet to be published/discovered, will have the cost of INT and the representation accuracy of FP—some have made such claims, or better. We personally find Lemurian Labs PAL most promising, but much remains undisclosed regarding its digital format. They claim its 16-bit precision and range are better than FP16 and BF16, while also being cheaper in hardware.

As we continue to expand to the older 8-bit format, PAL4 also claims better distribution than logarithmic systems like Nvidia on HotChips. Their claims on paper are impressive, but no hardware implementation of the format exists yet.

09. Block Number Formats

An interesting observation is that the size of an element is almost always similar to that of a nearby element in the tensor. When the elements of a tensor are much larger than usual, the nearby elements are essentially unimportant—they are relatively too small to be seen in the dot product. We can exploit this—we can share a single exponent across multiple elements instead of using a floating-point exponent for each number. This saves a lot of redundant exponents. This approach has existed for some time—Nervana Flexpoint, Microsoft MSFP12, Nvidia VSQ—until the microscaling of OCP in 2023. By then, a whole set of possible formats existed, with different trade-offs. Microsoft is attempting to quantify the design space of its hardware:

Hardware vendors face a thorny problem: trying to design highly specialized and efficient formats without closing the door to future model architectures that may have drastically different numerical distributions.

10. Inference

Much of the above applies to inference and training, but each has its own specific complexities. Inference is particularly sensitive to cost/power consumption because models are typically trained only once but deployed to millions of customers. Training is also more complex, with many numerically problematic operations (see below). This means that inference chips are often far ahead of training chips in terms of using smaller, cheaper digital formats, so there can be a significant gap between the format used for model training and the format used for model inference. Several tools exist to adapt from one format to another, falling into one category: On one hand, post-training quantization (PTQ) doesn't require performing any actual training steps; it simply updates the weights according to some simple algorithm: The simplest approach is to round each weight to the nearest value.

The easiest approach is to simply round each weight to its nearest value. `LLM.int8()` converts all but a small subset of outlier weights to INT8; `GPTQ` uses second-order information about the weight matrix for better quantization; `Smoothquant` performs a mathematically equivalent transformation, attempting to smooth outlier activations; `AWQ` uses information about activations to more accurately quantize the most significant weights; `QuIP` preprocesses the model weights, making them less sensitive to quantization; `AdaRound` optimizes the rounding of each layer as a quadratic binary optimization; many other methods exist and are constantly being released. Many "post-training" quantization methods blur the line between training and quantization by iteratively optimizing the quantized model using some modified training step or surrogate objective. The key aspect here is that these drastically reduce costs, but the real-world performance loss is often greater than that of the simple benchmarks that are frequently touted.

On the other hand, quantization-aware training (QAT) alters the accuracy and continues training for a period to allow the model to adapt to the new accuracy. All quantization methods should utilize this mechanism at least partially to achieve minimal accuracy loss in real-world performance. This directly uses the regular training process to adapt the model to the quantization mechanism and is generally considered more efficient, but computationally more expensive.

11. Training

The training is slightly more complex due to the backward propagation. There are three matmuls—one in the forward propagation and two in the backward propagation.

Each training step ultimately receives weights, performs a series of matrix multiplications on various data points, and generates new weights. FP8 training is more complex. Below is a slightly simplified version of the Nvidia FP8 training method.

Some notable features of this list: Each matmul is an FP8 x FP8 summed to FP32 (actually lower precision, but Nvidia tells everyone it's FP32), then quantized back to FP8 for use in the next layer. The summation must be more precise than FP8 because it involves tens of thousands of consecutive small updates to the same large accumulator, so each small update needs high precision to avoid rounding down to zero; Each FP8 weight tensor comes with a scaling factor. Since the range of each layer can be significantly different, scaling each tensor to fit the range of that layer is crucial; Weight updates (outside the main box) are very sensitive to precision and are typically kept at high precision (usually FP32). This again boils down to magnitude mismatch—weight updates are small compared to the weights, so precision is again needed to prevent the updates from rounding down to zero; Finally, a major difference between training and inference is that gradients have more extreme outliers, which is very important. Activation gradients (e.g., SwitchBack, AQT) can be quantized to INT8, but weight gradients have so far resisted this effort and must be kept in FP16 or FP8 (1,5,2).

What factors need to be considered when designing an AI chip?

Read next

CATDOLL 128CM Katya

CATDOLL CATDOLL 115CM Shota Doll Kiki (Customer Photos)

CATDOLL Christina Hard Silicone Head

CATDOLL Dodo 109CM TPE