Despite the hopes and dreams many embedded engineers have, highly reliable code is not achieved overnight. It's a painstaking process that requires developers to maintain and manage every bit and byte of the system. There's often a sense of relief when an application is deemed "successful," but just because software works correctly under controlled conditions at that moment doesn't mean it will work correctly tomorrow or a year from now.
There are many techniques for developing highly reliable embedded systems, ranging from well-defined development cycles to rigorous execution and system checks. Here are seven easy-to-implement and long-lasting techniques that are very helpful in ensuring more reliable system operation and capturing abnormal behavior.
Tip 1 – Fill the ROM with known values
Software developers are often a very optimistic bunch, content as long as their code runs faithfully for a long time, nothing more. It seems quite rare for a microcontroller to jump out of application space and execute in an unexpected code space. However, the chances of this happening are no less than a buffer overflow or a faulty pointer losing its reference. It can actually happen! The system behavior after this will be unpredictable, because memory spaces are 0xFF by default, or, since memory areas are typically unwritten to, their values might be unknown to anyone.
However, there are quite a few linker or IDE techniques that can be used to help identify such events and recover the system from them. One technique is to use the FILL command to fill unused ROM with known bit patterns. There are many different possible combinations of methods to fill unused memory, but if you want to build a more reliable system, the most obvious choice is to place an ISR (In-System Error Detector) fault handler in these locations. If something goes wrong with the system and the processor starts executing code outside of program space, an ISR is triggered, providing an opportunity to store the processor, registers, and system state before deciding on corrective action.
Tip 2 – Check the application's CRC
A significant benefit for embedded engineers is that our IDEs and toolchains can automatically generate application or memory space checksums, which can then be used to verify the application's integrity. Interestingly, in many of these cases, the checksum is only used when the program code is loaded onto the device.
However, if the CRC or checksum remains in memory, verifying that the application is still intact at startup (or even periodically for long-running systems) is an excellent way to ensure that unexpected events don't occur. The probability of a programmed application changing is now very small, but considering the billions of microcontrollers shipped annually and the potentially harsh operating environments, the chance of an application crashing is not zero. More likely, a flaw in the system could cause a flash write or flash erase in a sector, compromising the application's integrity.
Tip 3 – Perform a RAM check at startup
To build a more reliable and robust system, ensuring the system hardware functions correctly is crucial. After all, hardware can fail. (Fortunately, software never fails; it only does what the code tells it to do, whether correctly or incorrectly.) Verifying that there are no problems inside or outside the RAM at startup is a good way to ensure the hardware works as expected.
There are many different methods for performing RAM checks, but a common approach is to write to a known pattern, wait a short time, and then read it back. The result should be that what is read is what is written. In fact, in most cases, the RAM check passes, which is the desired result. However, there is a very small chance that the check will fail, which provides an excellent opportunity to identify hardware problems in the system.
Tip 4 – Using Stack Monitor
For many embedded developers, the stack seems like a rather mysterious force. When strange things start happening, engineers are finally stumped, and they begin to wonder if something is going on in the stack. The result is blindly adjusting the stack size and location, and so on. But the error is often unrelated to the stack—but how can one be so sure? After all, how many engineers have actually performed worst-case stack size analysis?
The stack size is statically allocated at compile time, but the stack is used dynamically. As code executes, variables needed by the application, return addresses, and other information are continuously stored on the stack. This mechanism causes the stack to grow continuously within its allocated memory. However, this growth can sometimes exceed the capacity limit determined at compile time, causing the stack to corrupt data in adjacent memory regions.
One way to absolutely ensure the stack is functioning correctly is to implement a stack monitor as part of the system's "healthcare" code (how many engineers actually do this?). The stack monitor creates a buffer between the stack and other memory areas, filled with a known bit pattern. The monitor then continuously monitors the pattern for any changes. If the bit pattern changes, it means the stack has grown too large and is about to push the system into a disaster! At this point, the monitor can record the event, system state, and any other useful data for later problem diagnosis.
Most real-time operating systems (RTOS) or microcontroller systems that implement a memory protection unit (MPU) provide a stack monitor. Worryingly, these features are disabled by default or often intentionally disabled by developers. A quick online search reveals many people suggesting disabling the stack monitor in RTOS to save 56 bytes of flash memory. Wait a minute, this is a counterproductive approach!
Tip 5 - Using MPU
In the past, it was difficult to find a memory protection unit (MPU) in a small and inexpensive microcontroller, but that is beginning to change. Now, MPUs are found in microcontrollers from high-end to low-end, and these MPUs provide embedded software developers with an opportunity to significantly improve the robustness of their firmware.
MPUs are increasingly coupled with the operating system to create a separate memory space where processing is handled independently, allowing tasks to execute their code without fear of being stomped on. If something does happen, uncontrolled processing is canceled, and other protective measures are implemented. Keep an eye out for microcontrollers with this component; if you find one, make good use of its features.
Tip 6 - Build a robust watchdog system
One of the most common and favored watchdog implementations you'll often find is one where the watchdog is enabled (which is a good starting point), but also where it can be cleared using a periodic timer; the timer's activation is completely isolated from any events occurring in the program. The purpose of using a watchdog is to help ensure that if an error occurs, the watchdog is not cleared, meaning that when work is paused, the system is forced to perform a hardware reset to recover. Using a timer independent of system activity allows the watchdog to remain cleared even if the system fails.
Embedded developers need to carefully consider and design how application tasks are integrated into the watchdog system. For example, one technique might allow each task running for a certain period to indicate that it has successfully completed its task. In this event, the watchdog is not cleared but forcibly reset. More advanced techniques involve using an external watchdog processor, which can monitor the main processor's performance and vice versa.
For a reliable system, establishing a robust watchdog system is crucial. Due to the sheer number of technical aspects involved, it's impossible to cover everything in these few paragraphs, but I will publish further articles on this topic in the future.
Tip 7 - Avoid volatile memory allocation
Engineers unaccustomed to working in resource-constrained environments might try to leverage features of their programming language that allow them to allocate memory using volatile memory. After all, this is a technique commonly used in computer systems, where memory is allocated only when necessary. For example, when developing in C, engineers might prefer to use malloc to allocate space on the heap. An operation is performed, and once complete, the allocated memory can be returned using free, allowing for heap usage.
In resource-constrained systems, this could be a disaster! One problem with using volatile memory allocation is that incorrect or improper techniques can lead to memory leaks or fragmentation. When these problems occur, most embedded systems lack the resources or knowledge to monitor the heap or handle it properly. And what happens when an application requests space but the requested space is unavailable?
The problems arising from using volatile memory allocation are complex, and handling them properly can be a nightmare! An alternative approach is to simplify memory allocation directly in a static manner. For example, instead of requesting a memory buffer of that size via malloc, simply create a 256-byte buffer in the program. This allocated memory can be retained for the entire lifetime of the application without concerns about heap or memory fragmentation issues.
in conclusion
These are just some of the methods developers can use to begin building more reliable embedded systems. There are many other techniques, such as utilizing good coding standards, detecting bit flips, performing array and pointer boundary checks, and using assertions. All of these techniques are secrets to enabling designers to develop more reliable embedded systems.
For more information, please follow the Embedded Systems channel.