For embedded software, smaller code size is always better. Compressing code to fit storage subsystems constrained by cost or space has become an important aspect of embedded system development. ARM, MIPS, IBM, and ARC all offer techniques for reducing memory footprint, and this article will compare and analyze the implementations of code compression techniques in these architectures.
Today, it's no longer surprising that the cost of a storage subsystem can exceed that of a microprocessor. Therefore, choosing a processor that saves on storage costs becomes crucial. Writing compact code is only one aspect; the processor's instruction set also significantly impacts memory consumption. For processors with poor code density, no amount of effort in compressing your C source code will help. If you're concerned about memory consumption, choosing the right processor and carefully tuning your code is wise.
Not all processors have or need code compression. Only 32-bit RISC (Reduced Instruction Set Computer) processors require code compression because RISC processors have lower code density. RISC processors were designed in the past for general-purpose computers and workstations, with memory considered inexpensive at the time of their design. While memory may be cheaper, wouldn't it be cheaper to use less memory? For cellular phones and other cost-sensitive embedded systems, a $5 difference in cost between RAM and ROM can lead to a huge difference in profits at mass production. Typically, memory size is fixed, while product features vary. Compact target code means more automatic dialing, better voice recognition, or perhaps a clearer screen display.
In the 32-bit embedded processor market, ARM, MIPS, and PowerPC were among the first to discover ways to reduce memory consumption and increase code density. Earlier processors, such as Motorola's 68k series and Intel's x86 series, did not require code compression. In fact, their standard code density was higher than that of RISC processors under compression modes.
Easy-to-use Thumb technology
We'll start with ARM's code compression scheme (Thumb), because it's widely used, well-supported, a typical processor code compression scheme, and quite simple and effective.
Thumb is actually a separate instruction set added on top of ARM's standard RISC instruction set. In your code, you can switch between these two instruction sets using a single mode-switching instruction. The Thumb Instruction Set Architecture (ISA) consists of approximately 36 16-bit instructions. While these instructions alone cannot accomplish many tasks, the Thumb instruction set includes basic addition, subtraction, circular shifts, and jump instructions. By replacing ARM's standard 32-bit instructions with these shorter instructions, the size of some code can be reduced by approximately 20% to 30%. However, some issues need to be considered:
First, Thumb code and standard ARM code cannot be mixed; you must explicitly switch between the two modes as if Thumb were a completely different instruction set (which it actually is). This forces programmers to separate all 16-bit and 32-bit code into independent modules.
Secondly, because Thumb is a simplified and streamlined instruction set architecture, not all the work you want can be done in Thumb mode. Thumb mode cannot perform tasks such as interrupt handling, long jumps, atomic memory operations, or coprocessor operations. Thumb's limited instruction set means it is only useful for basic arithmetic and logic operations; any other work must be done using ARM's standard 32-bit instruction set.
The limitations of Thumb mode extend beyond the instruction set. In Thumb mode, the ARM processor has only eight registers (instead of sixteen), which prevent conditional execution and shift or circular shift operations as are possible with standard ARM code. Passing parameters between standard ARM and Thumb code is straightforward; parameters can be placed on the stack or passed through the processor's first eight registers.
Switching back and forth between standard mode and Thumb mode also takes time and adds code. Furthermore, dozens of preamble and postamble instructions are needed to organize pointers and clear the CPU pipeline. If the code running in Thumb mode is less than a few dozen instructions, it's not worth the overhead.
Finally, Thumb also has a slight impact on performance. Typically, compressing code using Thumb instructions results in a roughly 15% decrease in execution speed, primarily due to switching between 16-bit and 32-bit modes. Thumb instructions are also less flexible than standard 32-bit instructions, often requiring more instructions to accomplish the same task compared to 32-bit code. On the positive side, because its instruction length is only half that of a 32-bit instruction set, Thumb allows for more efficient cache utilization.
If the task can be completed within these constraints, Thumb can save considerable costs. Thumb technology is supported by every ARM processor, and most ARM compilers and assemblers support the Thumb instruction set, regardless of user usage. Therefore, the experience using Thumb should be quite smooth.
MIPS processor
Once you understand Thumb technology, MIPS16e is nothing new. Some MIPS processors add an additional 16-bit instruction set, very similar to ARM systems. The MIPS16e instruction set includes a simplified version of a set of standard 16-bit MIPS algorithms, logic, and jump instructions. Its usage is similar to Thumb, requiring switching between standard and MIPS16e modes, which incurs time and code overhead. Unless you can run in "compressed" mode for a considerable period, mode switching is unnecessary. Its code compression efficiency is similar to ARM, around 20% to 30% for most programs.
Neither MIPS16e nor Thumb can truly compress code; they only provide alternative opcodes for some instructions, and the compression ratio depends on the ratio of the short opcode to the long opcode. In other words, system-level code, such as the operating system and interrupt handlers, cannot use 16-bit instructions and therefore cannot achieve code compression. General algorithms, as long as they don't use any large operands, can achieve good compression efficiency. Finally, remember that data cannot be compressed; only code can be compressed. If your application code includes a large number of static data structures, the total memory savings are very small. Also, a 15% performance penalty may not be worthwhile. On the other hand, MIPS16e and Thumb are both free (assuming your processor already includes them), making their selection very inexpensive.
PowerPC's CodePack technology
It's worth noting beforehand that IBM's CodePack method is the most complex of all code compression technologies. Unlike Thumb and MIPS16e, CodePack truly compresses runtime code, much like running WinZip within PowerPC software. CodePack analyzes and compresses the entire program, and the resulting user code must be decompressed and the compressed version executed at runtime. Despite its complexity, CodePack, like other technologies, offers 20% to 30% space savings.
CodePack is an attractive technology. When using it, you simply compile embedded PowerPC code using standard tools as usual; CodePack can even work with existing code (regardless of whether the source code is available). Before writing the code to ROM or loading it to disk, the CodePack compression tool compresses the code. The compression tool analyzes the distribution of the code instructions and generates a key-value pair specific to that program code. When running the compressed code, a processor with CodePack functionality uses this key-value pair to decompress the code on runtime, just as if it were running the compressed code directly. Decompression introduces a small delay to the processor pipeline, but this effect is masked by instruction fetch latency and other delays. For the vast majority of applications, the performance impact of CodePack is negligible.
However, CodePack has some other implications. Because each compressed program has its own unique compression key, CodePack is essentially both a compression and encryption system. Without the key, neither you nor anyone else can run the corresponding program. If the key is lost or not obtained, the compressed program is just a bunch of useless gibberish, meaning that the compressed PowerPC program is not binary-compatible. Unless its decompression key is also included, compressed programs cannot be easily exchanged with other systems. This makes the on-site deployment of embedded system software slightly more complex.
Furthermore, CodePack generates two key-value pairs for each program because the high 16 bits and low 16 bits of the instruction are compressed separately. IBM engineers discovered that the high half-word (containing the opcode) and low half-word (which typically contains operands, offsets, or masks) of each PowerPC instruction have different distribution frequencies. Using different compression algorithms for them results in better compression than using any single algorithm, which is what CodePack does to programs.
ARCompact
ARC International employed an alternative code compression method. Because the ARCtangent processor had a user-definable instruction set, ARC (and its users) could modify it at will. With ARCompact, ARC decided to add a set of 16-bit instructions to improve the processor's code density.
The difference between ARCompact and Thumb and MIPS16e is that it can arbitrarily mix 16-bit and 32-bit code. Since there is no mode switching, the few 16-bit instructions scattered throughout the code incur no overhead. Whenever possible, the ARC compiler's default configuration will produce 16-bit operations (this feature can be turned off to force the compiler to generate 32-bit code or to maintain compatibility with older processors).
ARC can mix code of varying lengths without incurring the overhead because its instruction set architecture is newer than ARM and MIPS. The instruction sets of those RISC architectures (including PowerPC) do not specify the instruction length in the instruction word. Newer pseudo-RISC architectures such as ARC or Tensilica, as well as older architectures like x86 and 68k, possess these bits. Whether unintentionally or foresightfully, variable-length instruction sets are advantageous due to their more compact code.
Thumb-2, an improved version of Thumb.
ARM revamped its code compression system and released Thumb-2. Thumb-2 is not an upgrade to Thumb; rather, it's a completely new system that can fully replace Thumb and the original ARM instruction set. Thumb-2 is somewhat similar to ARCompact or Motorola's 68k, capable of running mixed 16-bit and 32-bit code without mode switching. Overall, Thumb-2 offers slightly lower code compression efficiency, but with less performance loss.
To achieve this, ARM needed to find a hole in its opcode mapping, and they found it in the BL instructions (bar-to-parallel concatenation instructions, used to switch between Thumb and ARM modes). In the original instruction set, some bits in the BL instructions were unused; these previously undefined bits provided a switching entry point for the entirely new instruction set. The encoding wasn't great, but it was undeniably effective.
The biggest advantage of Thumb-2 is that it is a complete instruction set, so programs do not need to switch back to the "standard" 32-bit ARM mode, and the limitations of the original Thumb mode are gone. Programs can now handle interrupts, set the MMU, manage caches, and are no different from a real microprocessor.
Thumb-2 still incurs some performance loss. Despite eliminating mode-switching overhead, it still requires more Thumb-2 instructions to complete specific tasks compared to standard ARM code. For ARM processors, these additional instructions (and extra cycles) can reduce speed by approximately 15% to 20%.
Future ARM processors will eventually run only Thumb-2 code. Since it effectively replaces both the ARM and Thumb instruction sets with a single, more compressed instruction set, why not eventually replace them entirely? But the question remains: what about ARM's software compatibility? Until now, all ARM processors (except Intel's XScale) are binary compatible. While new processors supporting Thumb-2 will be able to run existing ARM and Thumb code, the reverse is not true. Once Thumb-2 is widely adopted, it will create a separate but equivalent software library.
Disclaimer: This article is a reprint. If it involves copyright issues, please contact us promptly for deletion (QQ: 2737591964). We apologize for any inconvenience.