A Guide to Understanding Chip Backend Reports

First, I want to emphasize that I'm not a backend developer, but I frequently encounter discussions about PPAs (Power-Per-Click) with colleagues in marketing and chip development. At this point, the backend developers will typically present a table like this:

The image above shows a backend implementation result of A53, with the node being TSMC16FFLL+. Let's interpret it.

First, we need to understand that an ambitious mobile chip company doesn't have many manufacturers to choose from. TSMC, UMC, Samsung, GlobalFoundries (GF), and SMIC are barely considered options. Also, starting this year, Intel's ICF will be open to ARM processors. In fact, some have already started working on it, but they're not using third-party physical libraries. Typically, new processes are chosen from TSMC, and then UMC is used when costs need to be reduced. GF has always been somewhat unconventional; they're hesitant to choose it for safety, and Samsung doesn't pay much attention to others, so nobody cares about them either. As for SMIC, well, that requires very high aspirations to choose.

I won't go into the specifics of what 16nm means, as there are many explanations online. TSMC's 16nm process is further divided into several smaller nodes, such as FFLL+ and FFC. These nodes differ in their maximum frequency, leakage current, and cost, making them suitable for different types of chips. For example, mobile phone chips prefer low leakage current and low cost, while server chips prefer high frequency, and so on.

Next, let's look at the first row of the table, Configuration. This is the easiest to understand. It uses a quad-core A53 processor, 32KB of L1 data cache, 1MB of L2 cache, and enables ECC and the encryption/decryption engine. These options have a significant impact on the area, and a smaller impact on the frequency and power consumption.

Next is the Performance target, the target frequency. Backend engineers refer to frequency as Performance. When implementing the backend, a primary parameter must be selected from frequency, power consumption, and area (PPA) as the main optimization target. This table is specifically designed for high-performance A53 processors. The higher the frequency, the larger the area and leakage current will be, which is unavoidable. I will post a report on low-power, small-area processors later for comparison.

Below is CurrentPerformance, which is the frequency currently achieved. What does TT/0.9V/85C mean? We know that on a wafer, the electron drift velocity cannot be the same at every point. Different voltages and temperatures result in different characteristics. We categorize these into PVT (Process, Voltage, Temperature), corresponding to TT/0.9V/85C respectively. Process has many conservatories, similar to a normal distribution; TT is just one of them. Based on electron drift velocity, there are also SS, S, TT, F, FF, etc. Typically, the backend results require a Signoff condition (usually SSG in our case). Chips that meet this condition are sent out for fabrication as a screening threshold; those below this threshold will be unqualified and will not reach the required frequency. Therefore, the lower the condition is set, the higher the yield. However, the condition cannot be set too low, otherwise the backend will be difficult to implement, or the equations may become unsolvable, resulting in no output. In x86, there's a term called "body quality," which refers to this PVT.

This column has four frequencies. The upper and lower groups are easy to distinguish; they are simply different voltages. When the frequency is fixed, the dynamic power consumption is the square of the voltage, which is common knowledge. The difference between the left and right groups of numbers is the Corner, namely TT and SSG respectively.

The next line is OptimizationPVT. As everyone knows, backend EDA tools essentially solve equations. They need an optimization target, and they will automatically find the optimal local solution. A value must be chosen between 1.0V and 0.9V as the most commonly used sweet spot for frequency, power consumption, and area. Here, 1.0V is chosen because its SSG (Supervisory Grid Size) is closer to the target requirement. Corners that don't meet this requirement can be sold off cheaply by downclocking.

The next line is leakage, which is static power consumption. Even when the CPU is idle, it still consumes this power, encompassing leakage from the logic within the four CPU cores and the L1 cache. However, the A53 itself doesn't include a L2 cache. Other smaller logic components, such as the SCU (Snooping Control Unit), are located outside the CPU cores and are called Non-CPUs, included in the MP4 player. This is what we see when the device is idle. The L2 and L3 caches can be disabled via powergating, but usually, they are not completely disabled, or cannot be disabled at all.

Below is Dynamic Power, or dynamic power consumption. Basically, when I've seen CPUs measure dynamic power consumption, they're running Dhrystone. Dhrystone is a very old benchmarking program, essentially performing string copying, and is easily optimized by software, compilers, and hardware. As a performance metric, it's primarily used by the MCU. However, it has an advantage: the program is small, with little data, and can run only in the L1 cache (if available), leaving only leakage current in the L2 cache and subsequent circuitry. Although accessing L2, L3, or even DDR consumes more energy than accessing the L1 cache, their latency is also higher, potentially causing the CPU pipeline to stall. The consequence is that Dhrystone can maximize the power consumption of the CPU core logic, exceeding the power consumption of programs accessing caches above L2. Therefore, Dhrystone is usually used as a metric for maximum CPU power consumption. In reality, it's possible to write programs that consume even more power than Dhrystone, called MaxPowerVector, which is used for SoC power estimation.

Dynamic power consumption is strongly correlated with voltage. The formula itself involves square powers, and frequency changes are also related to voltage, becoming cubic when crossing voltage levels. Therefore, although 1.0V may seem like only a 39% increase from 0.72V, the final dynamic power consumption could be three times higher. At high frequencies, dynamic power consumption accounts for the majority of the overall power consumption, so voltage should not be underestimated.

In addition, dynamic power consumption is related to temperature. It is impossible for the temperature of the SoC to be maintained at 0 degrees when it is running, so the power consumption is usually calculated at 85 degrees or higher, which I won't go into detail about.

The next line is Area. Area is the foundation of a chip company and is directly related to its gross profit margin. Therefore, given the performance requirements, the smaller the better, even at the expense of power consumption and higher voltage, hence the concept of OD (Overdrive). There's data showing that currently, on a 28nm process, the cost per square millimeter is approximately 10 cents. A very low-end mobile phone chip is at least 30mm (used in phones costing around 200 yuan, which you probably haven't even seen before, and it's a smartphone). The chip area cost is $3, not including packaging, testing, storage, and transportation. Even low-end chips are at least 40mm (in a 300 yuan phone). In the 600-700 yuan phones we commonly see, one-sixth of the cost is the mobile phone chip. Of course, on the other hand, there are also people who are not short of money, such as Apple. It is said that the A10 has a 125mm diameter on the 16nm process. If we convert it to the A53MP4 here, just looking at the area and not considering the power consumption, there is enough space for 120 A53s, which is extremely extravagant. This is an A53 running at 2.8G. If it were 1.5G, 150 could be achieved.

So what exactly does Apple do with all that space? First, modules like the GPU, Video, Display, baseband, and ISP can easily trade area for performance because they can process data in parallel. Power consumption can also be traded off against area; a simple way is to reduce the clock speed and increase the number of processing units. While this increases leakage current, it lowers the voltage, significantly reducing dynamic power consumption. An exception is the CPU's single-core performance. Why can Apple achieve 1.8 times the performance of the Kirin 960 while maintaining acceptable heat dissipation? This is related to the physics library, backend, frontend, and software.

First, the A10 uses 6-issue, while the contemporary A73 only used 2-issue. Of course, due to data and instruction dependencies, the performance increase isn't three times, and the 6-issue configuration results in a non-linear increase in area and power consumption. For comparison, I've seen ARM's 6-issue CPU model; under the same manufacturing process, its single-core performance per hertz is 1.8 times that of the A73, its dynamic power consumption is estimated to be more than twice that of the A73, and its area is also nearly twice that of the A73. Of course, its microarchitecture is quite different from the A73. This single-core chip runs on a 16nm process at 2.5GHz, with a single-core power consumption of approximately 1W. Since mobile phone chips can maintain power consumption at 2.5W without throttling, Apple's 2.3GHz A10 is still feasible.

To control power consumption, additional transistors need to be inserted during RTL (Real-Time Layer) design for clock gating, and this is done hierarchically: RTL level, module level, system level, and even the signal clock (the SoC clock I've seen typically accounts for one-third of the total logic circuit power consumption). This whole setup increases the area by at least one-third. Then there's power gating, which is also hierarchical. The simplest approach is to assign a switch to each cache block, and another switch to each module. More complex approaches calculate which cache banks won't be used for a short period based on different instructions and disable them directly. Power gating requires more latency than clock gating, and sometimes, if operations are frequent, power gating can be counterproductive, requiring careful consideration. Furthermore, the more complex the design, the harder the verification becomes, necessitating a balance. Besides the clock domain, there's also the power domain, which allows voltage adjustment based on different frequencies. Of course, the more domains there are, the more difficult the routing becomes, and the larger the area.

Going further up, different power states can be defined, allowing upper-layer software to participate in power management and scheduling. I've elaborated on this in a more detailed answer: How to evaluate ARM's big.LITTLE core switching technology?

Returning to the Apple A10, it also uses 6MB of cache. This is shockingly large even in mobile phones. Typically, a high-end A73 with 2MB, or an A53 with 1MB, is considered very advanced, while low-end models don't exceed 1MB. I've done some experiments on the A53 using SPECINT 2K; increasing the L2 cache from 128KB to 1MB only increases performance by less than 15%. With 6MB, the performance/area gain is not linear; it's a blatant trade-off between area and performance. Moreover, Apple promotes GeekBench 4.0, not SPECINT, suggesting this benchmark might be more sensitive to cache size; I might conduct experiments later. Incidentally, AnTuTu 5.0 has absolutely no relation to cache size, which is quite disheartening for high-end mobile chip companies. It seems this changed in AnTuTu 6.0, but I haven't investigated it in detail yet. As for the leakage caused by the large cache area, there is a solution: partially disable the cache, enabling only what is needed. This is a delicate task requiring both hardware and software coordination.

The factors affecting the area are not yet finished; the above is just the front end, and there are still a lot of considerations on the back end.

First, let's look at the bottom row of the table, Metal Stack. Chips are etched layer by layer during manufacturing, and each etching process requires a mask to prevent critical parts from being exposed to light. The 11m here indicates there are 11 layers. Transistors are at the bottom layer, while traces run from above. More layers make routing easier, as anyone familiar with PCB layout will understand. Logically, you should use more layers, but factories charge based on the number of layers—more layers mean more cost. Fewer layers not only make routing more difficult but also lower the overall area utilization. For example, with the A53, achieving 80% utilization with 11 layers is considered quite good. Therefore, the total area of a chip isn't simply the sum of the areas of each small module; placement and routing (PR) must also be considered to optimize area utilization.

Looking at the bottom two rows of the table, Logic Architecture and Memory, this is easy to understand: logic and memory, the two main module categories of digital circuits. This memory refers to on-chip static memory, not external DDR. What does uLVT mean? It stands for UltraLowVoltageThreshold, referring to the use of an ultra-low voltage threshold for the standard logic cell. Low voltage is certainly good for dynamic power consumption, but the leakage current of this standard cell is also very high, logarithmically related to the frequency. That is, for every tenfold increase in leakage current, the maximum frequency only increases by log10%. The backend can set a constraint for EDA tools, such as only using uLVT for no more than 1% of the critical path logic circuits that need to boost frequency, while using LVT, SVT, or HVT (voltage increases sequentially, leakage current decreases) for the rest, to reduce overall leakage current.

For dynamic power consumption, the backend can also customize the source and drain lengths of the transistor. The narrower the drain, the greater the current and the higher the leakage current. Correspondingly, the highest frequency can be reached. Therefore, we sometimes see parameters like uLVTC16 and LVTC24, where C refers to Channel Length.

Next is Memory, also known as MemoryInstance, or sometimes FCI (FastCacheInstance). Accessing Memory involves three important parameters: read, write, and setup. These three parameters can occur at the same time or different times. For L1 cache, they generally use the same time, which is one clock cycle, and this process cannot be piped. Starting with the A73, I've seen the critical path in the backend get stuck on accessing the L1 cache. In other words, the speed of this path determines the CPU's maximum frequency, and the size of the L1 cache also determines the index size; the larger the index, the slower the CPU and the lower the frequency. Therefore, high-end ARM CPUs have L1 caches that don't exceed 64KB, which is closely related to the backend. Of course, the benefits of increasing the L1 cache size also decrease non-linearly. The subsequent L2 and L3 caches can use multi-cycle access or multi-bank alternating access, allowing their sizes to be in the hundreds of KB/s or even a few MB.

Logic and memory are collectively referred to as the Physical Library, which is designed based on the Physical Development Kit (PDK) provided by the manufacturer for each process node. The Library is the lowest level that a fabless chip company can achieve. The ability to customize its own mature physical library is one of the hallmarks of this company's leading backend capabilities.

The last line, Margin, refers to the inevitable deviations that will occur during the factory's production process, and this line defines the range of these deviations. See the image below:

Blue indicates the distribution of some corners we just mentioned, and red indicates production variation. Test chips must be produced to correct for these variations. SB-OCV stands for stage-based-chip variation, which, together with several other variations, totals ±7%, meaning that 7% of the chips will not be within the results determined at the end of the back-end design phase.

There are also some setupUCs, which indicate the uncertainty of signal setup time, hold time, and PLL jitter range.

This concludes the interpretation of one report. Now let's take a look at the corresponding low-power implementation:

Here, the frequency drops to around 1.5GHz, reducing dynamic power consumption by 10% per GHz, but static power consumption drops to 12.88mW, only 25%. We can see that an LVT is used here, but not a uLVT, which is one reason why static power consumption can be so low. Since area is not the optimization target, it remains essentially unchanged, which is understandable because the channel width remains the same, and the logic area cannot be reduced.

Disclaimer: This article is a reprint. If there are any copyright issues, please contact us promptly for deletion (QQ: 2737591964). We apologize for any inconvenience.

A Guide to Understanding Chip Backend Reports

Read next

CATDOLL 126CM Sasha (Customer Photos)

CATDOLL 126CM Sasha

CATDOLL 80CM Nanako Full Silicone Doll

CATDOLL 115CM Cici TPE (Natural Tone) Customer Photos