How to design smarter edge AI

Why do we like big numbers?

As an engineer with over 40 years of experience as a semiconductor business R&D director and CMO, I believe both I and my colleagues are acting logically. However, how many of us can honestly say we weren't tempted by claims like "my product is faster than yours"? Perhaps that's just human nature.

The question is always one of the definitions: how do we define "faster," "lower power," or "cheaper"? This is the problem that benchmarking tries to solve—it's about having a consistent context and external standards to ensure you're comparing similar tests. Anyone who uses benchmarking understands this very well (for example, aiMotive was born from a leading GPU benchmarking company).

Addressing this need has never been more urgent when trying to compare hardware platforms for automotive AI applications.

When is it not 10 TOPS?

Regardless of whether or not they have a dedicated NPU, most SoCs refer to their ability to execute neural network workloads as TOPS: Tera operations per second. This is simply the total number of arithmetic operations that the NPU (or the entire SoC) can perform per second, whether it's concentrated in a dedicated NPU or distributed across multiple computing engines (such as GPUs, CPU vector coprocessors, or other accelerators).

However, no hardware execution engine can execute any workload with 100% efficiency. For neural network inference, certain layers (such as pooling or activation) are mathematically very different from convolutions. Before a convolution (or other layers like pooling) can begin, the data must be rearranged or moved from one place to another. At other times, the NPU may need to wait for new instructions or data from the host CPU that controls it, for each layer or even each block of data. All of this results in fewer computations being performed, thus limiting the theoretical maximum capacity.

Hardware utilization – not what it looks like

Many NPU vendors cite hardware utilization to indicate how well their NPUs perform with a given neural network workload. This essentially means, "This is how much of my NPU's theoretical capacity is used to execute neural network workloads." Of course, this tells me what I need to know.

Unfortunately, no. The problem with hardware utilization is that it's defined in many ways: the quantity depends entirely on how the NPU vendor chooses to define it. In fact, the problem with hardware utilization and TOPS is that they only tell you what the hardware engine is theoretically capable of, not the extent to which it is.

This could lead to some misleading information. Figure 1 below shows our comparison between the 4 TOPS aiWare3P NPU and another well-known NPU rated at 8 TOPS.

Figure 1: Comparison of utilization and efficiency of two automotive inference NPUs (Source: aiMotive using publicly available hardware and software tools)

For two different well-known benchmarks, this NPU claimed 8 TOPS compared to the aiWare3P's 4 TOPS, which should have meant it would deliver approximately twice the FPS performance of the aiWare3P. However, in reality, the opposite is true: the aiWare3P delivered 2 to 5 times the performance, despite only having half the claimed TOPS!

The conclusion is that TOPS is a very poor method for measuring the capabilities of AI hardware; hardware utilization is almost as misleading as TOPS.

NPU Efficiency and Autonomy: Key to Optimizing PPA

This is why I believe you must evaluate an NPU's capabilities based on its efficiency when executing a representative set of workloads, rather than its raw, theoretical hardware capabilities. Efficiency is defined as the percentage of claimed TOPS (top-levels) required to execute a particular CNN within a single frame. This number is calculated entirely from the underlying mathematics that defines any CNN, regardless of how the NPU actually evaluates it. It compares actual performance with claimed performance, which is what truly matters.

A highly efficient NPU means it will make full use of every square millimeter of silicon used to implement it, translating to lower chip costs and lower power consumption. Efficiency can achieve optimal PPA (performance, power, and area) for automotive SoCs or ASICs.

The autonomy of the NPU is another important factor. How much CPU load should be placed on the host CPU to achieve maximum performance for the NPU? How does this relate to the memory subsystem? The NPU as a large component in any SoC or ASIC must be considered—its impact on the rest of the chip and subsystem cannot be ignored.

in conclusion

When designing any SoC or ASIC for automotive applications, AI engineers must focus on building a production platform capable of reliably executing algorithms while achieving superior power-to-performance (PPA): lowest power consumption, lowest cost, and highest performance. They must also commit to the choice of hardware platform early in the design cycle, typically before the final algorithm development is complete.

Efficiency is the best way to achieve this goal; neither TOPS nor hardware utilization are good metrics. Assessing the NPU's autonomy is also crucial to meeting stringent production targets.

Tony King-Smith is an Executive Consultant at aiMotive. He has over 40 years of experience in the semiconductor and electronics industry, managing R&D strategy and hardware and software engineering teams for multinational companies such as Panasonic, Renesas, BAE Systems, and LSI Logic. Tony previously served as CMO of Imagination Technologies, a leading semiconductor IP provider.

How to design smarter edge AI

Read next

CATDOLL 138CM Yoyo (TPE Body with Soft Silicone Head)

CATDOLL 123CM Nanako TPE

CATDOLL Oksana Hard Silicone Head

CATDOLL CATDOLL 115CM Nanako Silicone Doll