Many cars on the road today, and even many new cars in showrooms, are equipped with advanced driver assistance systems (ADAS) based on different sensors such as cameras, radar, ultrasonic waves, or LIDAR.
The number of these systems will continue to increase as new legislation is passed; for example, in the United States, there is legislation mandating the installation of rearview cameras. Furthermore, factors such as car insurance discounts and vehicle safety ratings from organizations like the National Highway Traffic Safety Administration (NHTSA) and the European New Car Assessment Programme (Euro NCAP) are making certain systems mandatory features in cars; on the other hand, this is also fueling consumer demand for them.
Autonomous vehicle functions such as automatic parking, cruise control, and automatic emergency braking also rely heavily on sensors. It's not just the number or type of sensors that matters, but also how they are used. Currently, most ADAS systems in vehicles on the road operate independently, meaning they exchange very little information with each other. (Yes, some high-end vehicles have very advanced autonomous driving features, but these are not yet widespread.) Rearview cameras, surround-view systems, radar, and front cameras each have their own purpose. By adding these independent systems to the vehicle, more information can be provided to the driver, enabling autonomous driving functions. However, you can push the limits and achieve even more—see Figure 1.
Figure 1: ADAS exists as a single, independent function within a vehicle.
Sensor fusion
Simply using the same type of sensor repeatedly cannot overcome the shortcomings of each sensor. Instead, we need to combine information from different types of sensors. Camera CMOS chips operating in the visible spectrum encounter problems in dense fog, rain, glaring sunlight, and low light conditions. Radar, on the other hand, lacks the high resolution currently available in imaging sensors. We can find similar advantages and disadvantages in each type of sensor.
The brilliance of sensor fusion lies in acquiring input from different sensors and sensor types, and using the combined information to perceive the surrounding environment more accurately. This allows for better and safer decision-making compared to standalone systems. Radar may not have the resolution of light sensors, but it has significant advantages in ranging and penetrating rain, snow, and dense fog. These weather conditions or poor lighting conditions hinder the effectiveness of cameras, but cameras can distinguish colors (think of street signs and road markers) and have high resolution. Currently, image sensors on roads have a resolution of 1 megapixel. In the next few years, the trend in image sensor development will be 2 megapixels, and even 4 megapixels.
Radar and cameras are a prime example of the perfect fusion and complementarity of two sensor technologies. The functionality achieved by a fusion system using this approach far exceeds the sum of the capabilities of these individual systems. Using different types of sensors can provide additional redundancy even in environments where all sensors fail. This error or failure could be caused by natural causes (such as dense fog) or human factors (such as electronic or human interference with cameras or radar). Even in the event of a single sensor failure, such a sensor fusion system can maintain certain basic or emergency functions. The system failure might be less severe if it relies entirely on alarm functions or allows the driver to be prepared to take over control of the vehicle. However, highly automated and fully automated driving functions must provide sufficient time for the driver to regain control of the vehicle. During this period before the driver takes over, the control system needs to maintain a minimum level of control over the vehicle.
Example of a sensor fusion system
Sensor fusion varies in complexity and data type. Two basic examples of sensor fusion are: a) a rear-view camera plus ultrasonic ranging; b) a front-view camera plus multi-mode front radar—see Figure 2. We can now implement this by making minor modifications to existing systems and/or by adding a separate sensor fusion control unit.
Figure 2: The front radar and the front camera are integrated to achieve adaptive cruise control and lane keeping assist, or the rear-view camera is combined with the ultrasonic ranging alarm to achieve automatic parking.
• Rearview camera + ultrasonic ranging
Ultrasonic parking assist technology is widely accepted and mature in the automotive market; this technology provides audible or visual warnings of nearby objects while parking. As mentioned earlier, by 2018, all new vehicles manufactured in the United States were required to have rearview cameras. Combining information from both sources is essential for advanced parking assist functionality, which cannot be achieved by a single system. The rearview camera allows the driver to clearly see what's behind the vehicle, while machine vision algorithms detect objects, as well as curbs and street markings. The supplementary capabilities provided by ultrasound accurately determine the distance to identified objects and ensure basic proximity warnings even in low light or complete darkness.
• Front-view camera + multi-mode front radar
Another powerful combination is the integration of a forward-facing camera with a forward-facing radar. The radar can measure the speed and distance of objects up to 150 meters away in any weather conditions. The camera excels at detecting and distinguishing objects, including reading street signs and road markers. By using multiple camera sensors with different fields of view (FoV) and optics, the system can identify pedestrians and bicycles crossing in front of the vehicle, as well as objects at a range of 150 meters or more. It can also reliably implement functions such as automatic emergency braking and city stop-and-go cruise control.
In many cases, under specific, known external conditions, ADAS functions can be performed using only a single sensor or system. However, considering the many unpredictable situations on the road, this is insufficient for reliable operation. Sensor fusion, in addition to enabling more complex and autonomous functions, can also reduce false alarms and false negatives within existing functions. Convincing consumers and legislators that cars can be driven autonomously by "a machine" will be crucial.
Sensor fusion system segmentation
Unlike individual systems within a car that perform their own alarm or control functions, in a fusion system, the final action is determined centrally by a single device. The key question now is where data processing takes place and how sensor data is sent to the central electronic control unit (ECU). When fusing multiple sensors that are not centralized but distributed throughout the vehicle, we need to specifically consider the connections and cabling between the sensors and the central fusion ECU. The same applies to the location of data processing, as it also affects the overall system implementation. Let's examine two extreme cases of possible system partitioning.
Centralized processing
In the extreme case of centralized processing, all data processing and decision-making are done in the same location, with the data being "raw data" from different sensors—see Figure 3.
Figure 3: Centralized processing with a “traditional” satellite sensor module.
advantage:
Sensor modules are small, inexpensive, and consume little power because they only perform detection and data transmission tasks. Sensors are also flexible in their placement and require minimal installation space. Replacement costs are low. Typically, because they do not require processing or decision-making, sensor modules have lower functional safety requirements.
The central processing ECU (ECU) acquires all data because it is not lost due to preprocessing or compression within the sensor modules. Because sensors are relatively inexpensive and small in size, more sensors can be deployed.
shortcoming:
Sensor modules – real-time processing of sensor data requires broadband communication (up to several Gb/s), which may result in high levels of electromagnetic interference (EMI).
The central ECU (Engine Control Unit) requires high processing power and speed to handle all input data. For many high-bandwidth I/O and high-end application processors, this translates to higher power demands and greater heat dissipation. An increase in the number of sensors significantly increases the performance requirements of the central ECU. Some of these drawbacks can be overcome by using interfaces such as FPD-LinkIII to transmit various data types (sensor, power consumption, control, and configuration) over a single coaxial cable (bidirectional reverse channel). This greatly reduces the system's wiring requirements.
Fully Distributed System
Another starkly different extreme is the fully distributed system. In this case, advanced data processing and decision-making are performed to some extent by local sensor modules. The fully distributed system only sends object data or metadata (data describing object characteristics and/or identifying objects) back to the central fusion ECU. The ECU combines the data and ultimately decides how to act or react—see Figure 4.
Figure 4: A distributed system in which sensor data is processed by sensor modules and decisions are made by the central ECU.
Fully distributed systems have both advantages and disadvantages.
advantage:
Sensor module – A lower bandwidth, simpler, and cheaper interface can be used between the sensor module and the central ECU. In many cases, a CAN bus with a speed of less than 1Mb/s is sufficient.
The processing ECU—the central ECU—only fuses object data, thus requiring less processing power. For some systems, a single advanced safety microcontroller suffices. Smaller modules also require less power. Since much of the processing is done within the sensors, increasing the number of sensors does not significantly increase the performance demands on the central ECU.
shortcoming:
Sensor modules—Sensor modules require application processors, which makes them larger, more expensive, and consumes more power. Due to local processing and decision-making, sensor modules also have higher functional safety requirements. Of course, adding more sensors will significantly increase costs.
The processing ECU—the central decision-making ECU—can only acquire object data, but cannot access actual sensor data. Therefore, it is difficult to "zoom in" on the region of interest.
Finding the Golden Ratio
Depending on the number and types of sensors used in the system, and the scalability requirements for different vehicle models and upgrade options, combining two topologies can yield an optimized solution. Currently, many fusion systems use sensors with local processing for radar and lidar (LIDAR), and front-facing cameras for machine vision. A fully distributed system can combine existing sensor modules with an object data fusion ECU. "Traditional" sensor modules in systems such as surround-view and rear-view cameras allow the driver to see their surroundings—see Figure 5. More ADAS functions can be integrated into fusion systems such as driver monitoring or camera monitoring systems, but the principle of sensor fusion remains the same.
Figure 5: Finding the perfect combination of distributed and centralized processing
Platform management, target vehicle segmentation, flexibility, and scalability are important economic factors; these factors also play a significant role in segmenting and designing integrated systems. For any given specific situation, the resulting system may not be the optimal design, but from the perspective of the platform and the fleet, it may be the best option.
Who is the “viewer” of all this sensor data?
Regarding ADAS, there are two aspects we haven't discussed yet: informational ADAS versus functional ADAS. The former expands and extends the driver's sensory range (e.g., surround view and night vision) while the driver still maintains complete control over the car. The second is machine vision, which enables the car to perceive its surroundings and make and execute decisions (automatic emergency braking, lane keeping assist). Sensor fusion naturally merges these two worlds into one.
This is why we can use the same sensor for different purposes, but at the cost of limitations in choosing the optimal communication and processing locations between modules. Take surround view as an example. This feature was originally designed to provide the driver with a 360-degree field of view by transmitting video to a central display. Why not use the same camera and apply machine vision to it? A rearview camera can be used for reversing protection or automatic parking, while side-view cameras can be used for blind spot detection/alarms, including automatic parking.
Standalone machine vision systems process data locally within the sensor module and then transmit object data or even commands via simple, low-bandwidth connections such as a CAN bus. However, such connections are insufficient for transmitting complete video streams. Video compression can certainly reduce the required bandwidth, but it's not enough to reduce it to the hundreds of megabits range, and it has its own inherent problems. This becomes even more challenging with increasing high dynamic range (HDR) resolution, frame rates, and exposure counts. High-bandwidth connections and the camera module's lack of involvement in data processing solve the video issue, but now processing needs to be added to a central ECU to run the machine vision. A lack of central processing power or adequate thermal control can become a bottleneck in this solution.
While it is technically possible to perform processing within the sensor module and simultaneously use high-bandwidth communication, it is not particularly advantageous from the perspective of overall system cost, power consumption, and installation space.
Reliable operation of sensor fusion configuration
Because many integrated systems are capable of autonomously controlling specific vehicle functions (such as steering, braking, and acceleration) without a driver, functional safety needs to be carefully considered to ensure safe and reliable operation under varying conditions and throughout the vehicle's lifespan. Once a decision is made and autonomous operation is subsequently implemented, the requirements for functional safety increase significantly.
If a distributed approach is adopted, each module that processes critical data or makes decisions must meet those increased standards. This increases bill of materials (BOM) cost, size, power consumption, and software compared to modules that only collect and transmit sensor information. In environments with limited installation space, devices are difficult to cool, and the risk of damage and the need for replacement are high (a simple minor accident could require replacing the bumper and all connected sensors), which may negate the advantages of a distributed system with multiple sensor modules.
If a "traditional" sensor module is used, self-testing and fault reporting are required to ensure the safe operation of the entire system, but it has not yet reached the level of an intelligent sensor module.
While purely driver information systems can shut down and notify the driver if their functionality is impaired, highly autonomous driving systems are not so free. Imagine a car performing an emergency stop and then suddenly releasing the brakes. Or, imagine a car on a highway with the entire system shut down while the driver is sound asleep in "fully autonomous" mode (a possible future scenario). The system needs to remain operational for a period of time, at least a few seconds to half a minute, before the driver can safely take control of the vehicle. There seems to be no clear consensus in the industry regarding the extent to which the system must operate and how to ensure its operation in case of malfunction. Aircraft with autopilot capabilities typically use redundant systems. While we generally consider them safe, they are expensive and take up a lot of space.
Sensor fusion will be a key step toward autonomous driving and enjoying the journey and driving pleasure.
Multi-sensor information fusion algorithm
The defining characteristic of intelligent vehicles is their intelligence, meaning that the car can perceive the road environment through its onboard sensor system, automatically plan its route, and control the vehicle to reach its intended destination. Currently, onboard perception modules include visual perception modules, millimeter-wave radar, ultrasonic radar, and 360° surround-view systems. The collaborative work of multiple sensors identifies road lane lines, pedestrians, vehicles, and other obstacles, ensuring safe driving. Therefore, the perceived information needs to be integrated and complementary.
This introduces an important concept: information fusion. Different sensors are used to detect different operating conditions and targets. For example, millimeter-wave radar primarily identifies obstacles at a medium to long distance (0.5m-150m), such as vehicles, pedestrians, and roadblocks. Ultrasonic radar primarily identifies obstacles close to the vehicle (0.2m-5m), such as curbs during parking, stationary vehicles in front and behind, and passing pedestrians. The two sensors work together, complementing each other's weaknesses, and by fusing data such as the angle, distance, and speed of obstacles, they characterize the vehicle's surrounding environment and accessible space.
Figure 6: Intelligent Vehicle Sensing Module
Information fusion was initially called data fusion, which originated from the sonar signal processing system developed with funding from the U.S. Department of Defense in 1973. In the 1990s, with the widespread development of information technology, the concept of "information fusion" with a broader meaning was proposed, and the multi-sensor data fusion (MSDF) technology also emerged.
The main advantages of data fusion lie in: fully utilizing multi-sensor data resources from different times and spaces; employing computer technology to obtain multi-sensor observation data in time series; and analyzing, synthesizing, managing, and using the data under certain criteria. This results in a consistent interpretation and description of the measured object, enabling corresponding decision-making and estimation, and allowing the system to obtain more comprehensive information than its individual components.
Generally, the multi-source sensor data fusion process includes six steps, as shown in the figure below. First, the multi-source sensing system is built and calibrated. Then, data is collected and digital signal is converted. Next, data preprocessing and feature extraction are performed. Then, the fusion algorithm is used for calculation and analysis. Finally, stable, more complete, and consistent target feature information is output.
Figure 7: Multi-source data fusion process
The comprehensive and complete information about objects and the environment acquired from multiple sensors is primarily reflected in the fusion algorithm. Therefore, the core issue of multi-sensor systems is selecting a suitable fusion algorithm. For multi-sensor systems, information is diverse and complex; therefore, the basic requirements for information fusion methods are robustness, parallel processing capability, and computational speed and accuracy. The following briefly introduces three commonly used data fusion algorithms: Bayesian statistical theory, neural network technology, and the Kalman filter method.
Bayesian statistical theory
Figure 8: Venn diagram
The theorem was first proposed by British mathematician Thomas Bayes in a paper published in 1763. Bayesian statistics is a statistical method used to estimate certain properties of statistics; it is a theorem concerning the conditional probability of random events A and B. "Conditional probability" refers to the probability that event A will occur given that event B has occurred, denoted by P(A|B). Based on the Venn diagram above, it is easy to derive: P(A∩B) = P(A|B)*P(B) = P(B|A)*P(A). From this, the formula for conditional probability can be derived, where P(A) is called the prior probability, meaning that we have an understanding of the probability of event A occurring before event B occurs.
For a simple example, traffic sign recognition (TSR) in the visual perception module is a crucial part of autonomous driving. During TSR recognition, obstructions to the traffic sign, such as trees and lampposts, are the main interference affecting recognition. Therefore, we are concerned with the detection rate when the traffic sign is obstructed. Here, we define event A as correct recognition of the traffic sign and event B as failure to recognize the traffic sign; B is when the speed limit sign is obstructed and event B is when the speed limit sign is not obstructed.
Figure 9: Obscured traffic speed limit sign
Based on existing algorithms, the probability of correctly identifying a traffic speed limit sign in event A can be calculated; this probability is called the prior probability. By reviewing the detection video recordings from the visual perception module, we can count how many of the detected speed limit signs are obscured and how many are not. We can also count how many of the missed speed limit signs are obscured and how many are not. Therefore, we can obtain the following values:
Therefore, the probability of correctly recognizing a speed limit sign when it is obscured can be calculated:
So, some people might ask, what is the recognition rate if the speed limit sign is not obscured? Similarly, we can calculate that here as well:
The calculations above show that the recognition rate is quite high when the speed limit sign is fully exposed and not obscured. However, the recognition rate is much lower when the speed limit sign is blocked. The combined use of these two metrics can serve as an important reference for evaluating the performance of current image processing algorithms in recognizing speed limit signs. Of course, the actual fusion process is much more complex than this, and XPeng Motors engineers are working diligently to continuously optimize and improve the recognition rate under various conditions, providing a more comfortable intelligent driving assistance experience.
Neural Network Theory
Figure 10: Neural Network
Artificial Neural Networks (ANNs) are a type of machine learning (ML) and are a "hot topic" in artificial intelligence, cognitive science, neurophysiology, nonlinear dynamics, information science, and mathematical sciences.
ANN's development has gone through three stages.
The first stage was the initial stage, gradually forming a new interdisciplinary field starting in the 1940s. In 1943, psychologist McCulloch and mathematician Pitts collaborated, integrating biophysics and mathematics, to propose the first neural computational model: the MP model. In 1949, psychologist Hebb, through observation and research on brain nerve cells, learning, and conditioned reflexes, proposed the Hebb rule, which alters the strength of neuronal connections and remains significant to this day.
The second stage is the development stage. In 1957, Rosenblatt developed the MP model, proposed the Perception Model, gave the convergence theorem for two-layer perceptrons, and proposed the important research direction of three-layer perceptrons with hidden processing elements. In 1960, Widrow proposed the Ada-line model, an adaptive linear element model, and an effective network learning method: the Widrow-Hoff learning rule.
The third stage is the mature stage. In 1982, Hopfield, a physicist at Caltech, proposed a new approach for associative memory and optimized computation—the Hopfield network—which led to a breakthrough in neural network research. In a 1984 paper, Hopfield pointed out that the Hopfield network could be implemented using integrated circuits, making it easy for engineers and computer scientists to understand, which attracted widespread attention from the engineering community.
In the late 1980s, the brilliance of neural networks was overshadowed by computer technology and the internet. However, the development of computer technology in recent years has given neural networks even greater opportunities. Neural networks are composed of layers of neurons. The more layers, the deeper the network; deep learning uses neural networks with many layers of neurons to achieve machine learning functions. Hinton is the originator of deep learning. In 2006, based on Deep Belief Networks (DBNs), he proposed an unsupervised greedy layer-by-layer training algorithm, bringing hope for solving optimization problems related to deep structures. He subsequently proposed a multi-layer autoencoder deep structure. Currently, deep learning neural network technology is widely used in computer vision, speech recognition, and natural language processing.
Research on neural networks encompasses numerous disciplines, including mathematics, computer science, artificial intelligence, microelectronics, automation, biology, physiology, anatomy, and cognitive science. These fields combine and permeate each other, mutually promoting the development of neural network research and applications.
Figure 11: An artificial nerve cell
Next, let's briefly introduce the basics of neural networks. Just as a biological brain is composed of many nerve cells, an artificial neural network (ANN) that simulates the brain is composed of many tiny structural modules called artificial neurons (also known as artificial nerve cells). An artificial neuron is like a simplified version of a real nerve cell. As shown in the diagram, the letters 'w' in the blue circles on the left represent floating-point numbers called weights. Each input entering an artificial neuron is associated with a weight 'w,' and these weights determine the overall activity of the neural network. For now, you can imagine all these weights are set to a random decimal between -1 and 1. Because weights can be positive or negative, they exert different effects on the inputs associated with them. A positive weight has an excitatory effect, while a negative weight has an inhibitory effect. When input signals enter a neuron, their values are multiplied by their corresponding weights, which are then used as the inputs represented by the large circles in the diagram. The 'kernel' of the large circle is a function called the activation function. It sums up all these new, weighted inputs to form a single activation value. The activation value is also a floating-point number and can be positive or negative. Then, the function's output, i.e., the neuron's output, is generated based on the activation value: if the activation value exceeds a certain threshold (let's assume the threshold is 1.0 in this example), a signal with a value of 1 is generated; if the activation value is less than the threshold of 1.0, a 0 is output. This is the simplest type of activation function for artificial neurons.
Figure 12: Neural Network Structure
Biological nerve cells in the brain are interconnected with other nerve cells. To create an artificial neural network, artificial nerve cells must also be interconnected in the same way. There are many different connection methods for this, the easiest to understand and the most widely used is the one shown in the diagram, where nerve cells are connected layer by layer. This type of neural network is called a feedforward network. The name comes from the fact that the output of each layer of nerve cells in the network is fed forward to the layer below it (the layer above it in the diagram), until the output of the entire network is obtained.
Nerve cells form a complex neural network system through the connection of the input layer, hidden layer, and output layer. Through effective learning and training, the results of the output layer become closer and closer to reality, and the error becomes smaller and smaller. When its accuracy meets certain functional requirements, the neural network training is complete. At this point, the constructed neural network system can solve many problems in machine learning, such as image recognition, speech recognition, and text recognition.
In the current development of intelligent driving, artificial neural network technology, and even the latest deep learning technology, are widely used in vehicle recognition, lane line recognition, and traffic sign recognition in visual perception modules. Through data collection and processing of Chinese road conditions, and the acquisition of real-world environmental data for various weather conditions (rain, snow, sunshine, etc.) and road conditions (urban roads, rural roads, highways, etc.), a reliable data foundation is provided for deep learning. The input layer data of the neural network, i.e., the data acquired by sensors, is multi-source and multi-directional. It can be information such as the position, shape, and color of obstacles from the visual perception module on the windshield; information such as the distance, angle, speed, and acceleration of obstacles detected by millimeter-wave radar and ultrasonic radar; and data on parking spaces and speed bumps collected from the 360° surround view system.
Kalman filtering
Kalman filtering is an algorithm that uses the state equations of a linear system to optimally estimate the system state using observed input and output data. Simply put, the Kalman filter is an "optimal recursive data processing algorithm." It is optimal, efficient, and even the most useful for solving a large number of problems. Kalman filtering can estimate the state of a dynamic system from a series of noisy data when the measurement variance is known. Because it is easy to implement in computer programming and can update and process field-acquired data in real time, Kalman filtering is currently the most widely used filtering method and has found good applications in communication, navigation, guidance, and control, among other fields.
Kalman filtering is a crucial method for multi-source sensor data fusion applications. To concisely illustrate the principle of Kalman filtering, we can use the process of fusing target position data from millimeter-wave radar and a visual perception module as an example. Consider a simple example: Advanced Driver Assistance Systems (ADAS) currently incorporate millimeter-wave radar and ultrasonic radar modules, both capable of effectively estimating and identifying the position of obstacles and vehicles. The radar utilizes active sensing principles, emitting millimeter waves and receiving echoes from obstacles, calculating angular distances based on wave propagation time. Both can identify vehicle positions, but how do we fuse this information, select the best data, and calculate the precise vehicle position? Kalman filtering is one method to solve this problem. The vehicle position data we acquire is always noisy. Kalman filtering utilizes the dynamic information of the target to remove the influence of noise, obtaining a good estimate of the target's position. This estimate can be for the current target position (filtering), for the future position (prediction), or for the past position (interpolation or smoothing). Kalman filtering is a dynamic iterative process that predicts and estimates the target's detection state at the next moment based on the target's current detection state.
Advanced Driver Assistance Systems (ADAS) are a crucial direction in the development of intelligent vehicles. They achieve this by fusing information from multiple sensor sources to create stable, comfortable, reliable, and dependable driver assistance functions for users. Examples include Lane Keeping Assist (LKA), Forward Collision Warning (FCW), Pedestrian Collision Warning (PCW), Traffic Sign Recognition (TSR), and Head Monitoring and Warning (HMW). The purpose of fusing multi-source information is to leverage data redundancy to provide a basis for reliable data analysis, thereby improving accuracy, reducing false alarm rates and missed detection rates, enabling self-checking and self-learning of the driver assistance system, and ultimately achieving the goals of intelligent and safe driving.