Share this

Current Status of Edge Generative AI Inference Technology

2026-04-06 05:16:56 · · #1

In less than a year, generative artificial intelligence has gained global recognition and adoption through OpenAI's ChatGPT (a popular transformer-based algorithm). Transformer-based algorithms can learn complex interactions between different elements of an object (such as sentences or questions) and translate them into human-like dialogue.


Driven by transformers and other large language models (LLMs), software algorithms have advanced rapidly, but the processing hardware that executes these algorithms has lagged behind. Even the most advanced algorithm processors lack the necessary performance to complete the latest ChatGPT queries within a second or two.


To compensate for performance limitations, leading semiconductor companies build systems using a large number of the best hardware processors. To do this, they make trade-offs between power consumption, bandwidth/latency, and cost. This approach works well for algorithm training but not for inference deployed on edge devices.


Power consumption challenge


Although training is typically based on fp32 or fp64 floating-point operations to generate large amounts of data, latency requirements are not stringent. However, it is power-intensive and expensive.


The inference process is quite different. Inference is usually performed on the fp8 algorithm, which still generates a lot of data, but requires strict latency, low power consumption, and low cost.


The solution for model training comes from compute farms. Compute farms need to run for days, consume large amounts of electricity, generate a lot of heat, and are expensive to purchase, install, run, and maintain. Worse still, the inference process often encounters obstacles, hindering the adoption of GenAI on edge devices.


The inference process is quite different. Inference is typically performed on the fp8 algorithm, which still generates a large amount of data, but requires critical latency, low power consumption, and low cost.


The current solution for model training comes from computing farms. These farms require several days to run, consume large amounts of electricity, generate significant heat, and are expensive to purchase, install, operate, and maintain. Worse still, the inference process often encounters obstacles, hindering the adoption of GenAI on edge devices.


Current Status of Edge Generative AI Inference Technology


Successful hardware accelerators for GenAI inference must meet five properties:


It boasts petaflop-level processing power and high efficiency (over 50%).


Low latency, providing query response within seconds.


Energy consumption should be controlled at 50W/Petaflops or below.


Affordable cost compatible with edge applications


It is field-programmable, allowing for software updates or upgrades, thus eliminating the need for hardware modifications at the factory.


Most existing hardware accelerators can meet some of the requirements, but not all. Old CPUs are the worst choice because their execution speed is unacceptable; GPUs are fairly fast, but consume a lot of power and have insufficient latency (making them the preferred choice for training); FPGAs offer compromises in both performance and latency.


The ideal device would be a custom/programmable system-on-a-chip (SoC) designed to execute transformer-based algorithms as well as other types of algorithms. It should support adequate memory capacity to store the massive amounts of data embedded in the LLM and be programmable to accommodate field upgrades.


There are two obstacles to achieving this goal: the memory wall and the high power consumption of CMOS devices.


Memory wall


In the early days of semiconductor development, it was observed that advancements in processor performance were offset by slow progress in memory access speeds.


Over time, the gap between the two continues to widen, forcing the processor to wait increasingly longer for memory to provide data. The result is a decrease in processor efficiency from full 100% utilization.


To mitigate the efficiency drop, the industry has designed a multi-level tiered memory structure that uses faster, more expensive memory technologies near the processor for multi-level caching, thereby minimizing the flow of traffic to main memory and even slower external memory.


Energy consumption of CMOS ICs


Contrary to intuition, the power consumption of CMOS ICs is primarily due to data movement rather than data processing. Memory access consumes several orders of magnitude more power than basic digital logic computations, as demonstrated by a study led by Professor Mark Horowitz at Stanford University.


When performing integer operations, adders and multipliers consume less than 1 picojoule, while when processing floating-point operations, they consume only a few picojoules. In contrast, accessing data in a cache requires an order of magnitude more, reaching 20-100 picojoules, while accessing data in DRAM requires three orders of magnitude more, exceeding 1000 picojoules.


The GenAI accelerator is a prime example of a data-movement-driven design.


The impact of memory wall and power consumption on latency and efficiency


The effects of memory walls and energy consumption in generative AI processing are becoming increasingly difficult to control.


In just a few years, ChatGPT's base model, GPT, has evolved from GPT-2 in 2019 to GPT-3 in 2020, then to GPT-3.5 in 2022, and finally to the current GPT-4. The size of each generation of the model and the number of parameters (weights, tokens, and states) have increased by orders of magnitude.


The GPT-2 model contains 1.5 billion parameters, the GPT-3 model contains 175 billion parameters, and the latest GPT-4 model increases the number of parameters to approximately 1.7 trillion (official figures have not yet been released).


The sheer number of these parameters not only forces memory capacity to reach terabyte levels, but also pushes memory bandwidth to hundreds of gigabytes per second, or even terabytes per second, during simultaneous high-speed access to these parameters during training/inference. Worse still, moving these parameters consumes a significant amount of energy.


Expensive hardware idle


The daunting data transfer bandwidth and significant power consumption between memory and the processor overwhelm processor efficiency. Recent analyses indicate that running GPT-4 on cutting-edge hardware results in an efficiency drop of around 3%. The expensive hardware designed to run these algorithms is idle 97% of the time.


The lower the execution efficiency, the more hardware is required to perform the same task. For example, suppose there are two vendors that can meet the demand of 1 Petaflop (1000 Teraflops). Vendors (A and B) have different processing efficiencies, 5% and 50% respectively (Table 2).


Therefore, Supplier A can only provide 50 Teraflops of effective processing power, not theoretical processing power. Supplier B will provide 500 Teraflops of processing power. To provide 1 petaflop of effective computing power, Supplier A needs 20 processors, while Supplier B only needs 2.

For example, a Silicon Valley startup plans to use 22,000 Nvidia H100 GPUs in its supercomputer data center. Roughly calculated, 22,000 H100 GPUs cost $800 million, representing the majority of its latest funding round. This figure does not include other infrastructure costs, real estate, energy costs, and all other factors in the enterprise's total cost of ownership (TCO) for hardware.


The impact of system complexity on latency and efficiency


Another example based on the most advanced gene AI training accelerator currently available will help illustrate this concern. This Silicon Valley startup's configuration of GPT-4 would require 22,000 Nvidia H100 GPUs deployed in eight groups on HGX H100 or DGX H100 systems, for a total of 2,750 systems.


Given that GPT-4 comprises 96 decoders, mapping them across multiple chips may mitigate the impact on latency. Since the GPT architecture allows for sequential processing, assigning one decoder to each chip, for a total of 96 chips, might be a reasonable setup.


This configuration is equivalent to 12x HGX/DGX H100 systems, which will affect not only the data transfer latency between individual chips, but also the data transfer latency between boards and between systems. Using incremental transformers can significantly reduce processing complexity, but it requires processing and storing state, which in turn increases the amount of data that needs to be processed.


Most importantly, the aforementioned 3% implementation efficiency is unrealistic. When the impact of system implementation and the associated longer latency are taken into account, the actual efficiency in practical applications will decrease significantly.


In the long run, GPT-3.5 requires far less data than GPT-4. From a business perspective, the complexity of using GPT-3 is more attractive than that of GPT-4. On the other hand, GPT-4 is more accurate and would become the preferred choice if the hardware challenges could be resolved.


Best cost analysis


Let’s focus on the implementation cost of systems capable of handling a large number of queries, such as Google’s 100,000 queries per second.


Using the most advanced hardware available today, we can reasonably assume that the total cost of ownership, including acquisition costs, system operation, and maintenance, is approximately $1 trillion. This is roughly equivalent to half of Italy's GDP in 2021, making it the world's eighth-largest economy.


ChatGPT's impact on the cost per query makes it commercially challenging. Morgan Stanley estimates the cost per query for Google search queries (3.3 trillion queries) in 2022 at 0.2 cents, considered a benchmark. The same analysis indicates that ChatGPT-3's cost per query is between 3 and 14 cents, 15-70 times the benchmark cost.


Looking for chip architecture solutions


The semiconductor industry is frantically searching for solutions to the cost/query challenges. While all attempts are welcome, the solution must come from a novel chip architecture that breaks down the memory wall and significantly reduces power consumption.


Read next

CATDOLL 115CM Nanako TPE (Customer Photos 2)

Height: 115cm Weight: 19.5kg Shoulder Width: 29cm Bust/Waist/Hip: 57/53/64cm Oral Depth: 3-5cm Vaginal Depth: 3-15cm An...

Articles 2026-02-22