Abstract : Embedded Linux systems are characterized by using a bootloader instead of the desktop BIOS, and the system is scaled down. However, hardware limitations often result in slow boot times, and users of embedded products are particularly sensitive to boot speed. This creates a demand for improving the boot speed of embedded Linux systems. This paper discusses the stages of operation performed during system startup and methods to shorten these operation times.
1. Embedded Linux System Boot Sequence
Currently, embedded systems vary greatly in hardware platforms and application directions, but their overall startup process remains consistent. System startup here refers to the process from when the user performs a power-on/reset operation until the system begins to provide a user-acceptable level of service. Typical power-on/reset timings are listed in Table 1.
Table 1. Embedded Linux System Boot Sequence
2. Linux Quick Start Method
Currently, some Linux distributions have optimized boot speed. If developing using standard Linux, boot speed improvements are primarily achieved through kernel configuration and various patch packages. The following analyzes some key technologies for fast boot.
2.1 Firmware and Bootloader Stage
Once the target board is determined, the firmware runtime cannot be changed, and the read/write speeds of Flash and RAM are also fixed. However, if the firmware and bootloader can be bypassed during reset—that is, if the running kernel is allowed to load and run another kernel—the boot time can be shortened. A typical implementation is Kexec, which has two components: the user-space component kexectools and the kernel patch. Another method is to add reboot=soft number to the kernel command line, which can also bypass the firmware, but the drawback is that it cannot be called from user space.
For normal boot, a faster bootloader can be selected, and the kernel can be miniaturized; high-speed image copying techniques (such as DMA2RAM) can also be used to shorten copying time. To reduce decompression time, more efficient compression algorithms can be sought. However, generally, the higher the compression ratio, the more complex the algorithm, and the slower the decompression speed, thus creating a contradiction between copying time (inversely proportional to the compression ratio) and decompression time (generally directly proportional to the compression ratio).
2.2 Kernel Stage
During kernel initialization, the Real-Time Clock (RTC) needs to be synchronized. This process takes 1 second and can be omitted to save time, but this will cause the CPU to deviate from the correct time by 1 second. If the CPU clock needs to be stored in the RTC when the system is shut down, the deviation will continue to accumulate. However, for systems that use an external clock source for synchronization, this stage can be safely skipped.
Preset LPJ can be used to shorten the time spent calling calibrate_delay() to calibrate loops_per_jiffy at each startup. This time overhead is independent of CPU frequency and typically takes about 300ms in an embedded hardware environment. The LPJ value should be consistent across a fixed hardware platform and can be calculated only once. In subsequent startups, the LPJ value can be forcibly specified in the startup parameters, skipping the actual calculation process. Specifically, after normal startup, record the "Calibrating Delay" value from the kernel startup information and forcibly specify it in the startup parameters in the form of "lpj=xxxxxx".
The boot process defaults to opening the console to output boot messages, but the console, especially a framebuffer-based console, can slow down the boot process. Therefore, in embedded Linux products, the console is silenced during the boot process by adding "quiet" to the kernel boot parameters.
Device search and driver installation are time-consuming operations. Therefore, it's crucial to determine which driver modules need to be installed during kernel compilation to prevent the system from searching for non-existent devices, especially redundant IDE devices. For devices that don't require installation at startup, compile the drivers as modules and load them later when the device is idle or in use, rather than placing them all during the startup phase.
2.3 User Space Stage
Traditional Linux initialization scripts are executed by bash, which starts the init process (/sbin/init) after the kernel boots. It uses an ASCII file (/etc/inittab) to change the runlevel, which in turn calls RCSript. RCSript looks up /etc/rc.d/rc5.d/ and starts the system services pointed to by the corresponding links.
Linux systems for consumer electronics require essential services such as a graphical interface. Unoptimized systems will default to starting many unnecessary or currently unused system services during this process, consuming significant time. The simplest optimization method is to customize system services by rewriting service configuration files according to actual needs. Furthermore, the execution of the init script is sequential, which can lead to a very slow boot process when the script is large. Therefore, running various services in parallel can be considered to speed up the startup process. Several initialization programs have emerged to replace the init process; initng and upstart are introduced below.
`initng` (init nexterneration) enables services to start in parallel, thus quickly completing initialization. `initng` assumes that services whose dependencies are satisfied can start. While loading a script from external storage or waiting for hardware to boot, another script can run to start other services, achieving a better balance between CPU and I/O. As a dependency-based solution, `initng` uses its own set of initialization scripts that encode the dependencies between services and daemons. If a service depends on (defined using the `need` keyword) on other services, it ensures that all services it depends on are available at startup. Services without dependencies start immediately in parallel, while services with dependencies wait to start safely.
The difference between upstart and initng lies in their event-based nature. Upstart's startup/stop depends on whether the events it awaits occur. Upstart offers flexible event definitions, categorizing events into three types: edge (simple) events, level (value) events, and temporary events. Each event is described using an entry consisting of start/stop, the event name, and the expected value (optional). Event dependencies can be handled in two ways: one is where the task itself causes the event, regardless of when the task starts or ends. This method is effective for basic tasks executed at startup; for more complex dependencies, the task's shell script tool can be used.
2.4 Prefetching and Prelinking
Readahead loading preloads files (program and library files) into a RAM cache before they are used, avoiding I/O access to read them during operation. Knowing which files will be accessed in the next operation allows preloading them (in whole or in part) into the buffer, speeding up execution. In embedded systems, the next operation is often predictable; for example, the system always accesses the same executable/data files in the same order during startup, file block accesses are often sequential, and applications always access the same program file segments, shared libraries, resources, or input files during startup. Readahead loading provides a highly targeted approach, thus improving program execution speed.
ELF (Executable and Linkable File) is the standard binary format in Linux. Its startup requires the following steps: mapping shared libraries to the virtual address space; resolving symbolic references; and initializing each ELF file. Because shared libraries are position-independent, some relocation and symbol lookup work must be completed at runtime before jumping to the program's entry point. Therefore, while providing flexibility, this also results in slow ELF file startup, especially since resolving symbolic references consumes a significant amount of time, particularly for large programs using multiple shared libraries. However, in many embedded systems, the executable file and shared libraries rarely change, and the linking process is exactly the same every time the program runs.
Prelink leverages this by modifying ELF shared libraries and binaries, adding linking information to the executable to simplify dynamic linking relocations and thus speed up program startup. Prelink first gathers the ELF binaries to be prelinked and their dependent shared libraries, assigning a unique virtual space location to each library and relinking the shared library to this base location (when the dynamic linker needs to load the library, it will map the library to the specified location as long as the virtual space address is not occupied). Then, prelink resolves all relocations in the binary or library, stores the relocation information in the ELF object, and adds a list and checksum of all dependent libraries to the binary or library. For binary files, it also lists all conflicts (disagreements in symbol resolution within the natural search range of the shared library). At runtime, the dynamic linker first checks if all dependent libraries have been mapped to the specified location and if the library files have not changed; it only considers conflicts without processing relocations for each library, greatly improving program startup speed. It's important to note that if a shared library changes, all programs using it must be relinked; otherwise, the program will still need to undergo time-consuming normal relocations.
3. XIP and file system optimization
3.1 Code Execution Method
There are three main ways to execute code in an embedded system.
① Fully shadowed. When an embedded system program runs, all the code is copied from non-volatile memory (Flash, ROM, etc.) to RAM for execution.
② Demand paging. Only a portion of the code is copied to RAM. This method manages the import/export of pages in RAM. If an access occurs in virtual memory but not in physical RAM, a page fault will occur, and only then will the code and data be mapped to RAM.
③ eXecute In Place (XIP). During system startup, code is not copied to RAM; instead, it is executed directly in non-volatile memory. RAM only stores the constantly changing data, as shown in Figure 1. If the read speed of the non-volatile memory is similar to that of RAM, XIP can save time on copying and decompression. NOR Flash and ROM have relatively fast read speeds (approximately 100 ns), making them suitable for XIP; however, NAND Flash read operations are sector-based and relatively slow (in μs), therefore, XIP is not recommended.
Figure 1. Comparison of Full Mapping and XIP
XIP can be divided into the following two types:
① Kernel XIP. Running the kernel directly in Flash/ROM saves time spent copying and image decompression. The Linux 2.6.10 kernel already includes XIP support.
② Application XIP. This executes directly from the application code's storage location without loading it into RAM, resulting in faster initial execution. To use application XIP, a file system that supports it should be used.
3.2 XIP File System
Currently, there are two main implementations of the XIP file system: Linear XIP CRAMFS and Advanced XIP File System (AXFS).
CRAMFS is a compressed read-only file system originally designed for booting desktop Linux systems. However, modified CRAMFS can support embedded systems and XIP (Extended Page Instructions). Linear XIP CRAMFS uses a sticky bit to distinguish between files it manages, marking them as compressed (on-demand paging) or uncompressed (XIP). If a file is marked as XIP, all pages are uncompressed and stored contiguously in Flash. When loading an XIP file, all page addresses are directly mapped; for on-demand paging files, the corresponding pages are decompressed into RAM when a page fault occurs.
To create a Linear XIP CRAMFS filesystem image, you must determine the usage frequency of executable and library files. Frequently used files are suitable for XIP, while other files should be compressed. Several tools (such as RAMUST and CFSST) can help determine which files need XIP and which do not. Below, we can mark the XIP files and create the root filesystem, using the mkfs.cramfs tool as an example:
chmod +t filenames
mkfs.cramfs-x rootfs rootfs.bin
Additionally, kernel configuration parameters need to be modified to support XIP: Add `rootfstype=cramfs` to the default kernel command string in the boot options, select kernel XIP, and set the XIP kernel physical address; add MTD support for XIP in the driver; and add support for Linear XIP CRAMFS in the file system. Then you can generate the XIP image.
One drawback of Linear XIP CRAMFS is that it's file-based; all pages in a file must either be XIP-based or compressed/on-demand paging, but in reality, different pages within the same file can be used very differently. AXFS, developed by Intel, is a new read-only file system that inherits many methods from Linear XIP CRAMFS while also making some improvements. AXFS's XIP granularity is page-based, and it includes built-in tools to determine which pages need XIP and which need compression, thus achieving a better balance between speed and RAM/Flash usage.
3.3 Non-XIP File System
XIP is generally based on NOR Flash, which is relatively expensive. For applications with large amounts of user data, NAND Flash-based non-XIP file systems are often used, such as JFFS2 and YAFFS.
JFFS2 is a compression-based file system. In multimedia applications, if images, audio, and video are already compressed, using JFFS2 will undoubtedly place a double burden on the CPU for compression/decompression, affecting access speed. Therefore, in such application-intensive scenarios, using an uncompressed file system (such as YAFFS/YAFFS2) can speed up the system.
YAFFS/YAFFS2 is a journaling file system designed specifically for embedded systems using NAND Flash. Compared to JFFS2, it omits some features (such as lack of support for data compression), resulting in faster speed, shorter mount time, and lower memory consumption. YAFFS/YAFFS2 comes with its own NAND chip driver, allowing users to directly manipulate the file system without using MTD and VFS. The main difference between YAFFS and YAFFS2 is that the former only supports small-page (512-byte) NAND Flash, while the latter supports large-page (2 KB) NAND Flash and offers improvements in memory usage, garbage collection, and access speed.
Conclusion
Fast boot is one of the most pressing requirements for embedded Linux systems. This article analyzes the boot process and key latency factors of embedded systems, proposes corresponding solutions, and introduces the XIP file system. Since boot speed is highly dependent on the hardware platform, and some methods are mutually exclusive, a comprehensive consideration and selection are necessary in practical applications.
References
[1] Tim Bird R. Methods to Improve Bootup Time in Linux [R]. Proceedings of the Linux Symposium, Ottawa,2004.
[2] Karim Yaghmour. Building Embedded Linux Systems [M]. Beijing: China Electric Power Press, 2004: 49-66.
[3] Chen Lijun. In-depth analysis of Linux kernel source code [M]. Beijing: Posts & Telecom Press, 2001: 477-499.
[4] Zuo Daquan, Wu Gang. Embedded Linux fast startup and XIP application [J]. Computer Engineering and Science, 2006 (12).