Autonomous driving perception layer
The most crucial component of a perception system is the sensor. Common sensor arrays for autonomous driving include cameras, LiDAR, millimeter-wave radar, ultrasonic sensors, and a GNSS+IMU (GNSS + IMU). Cameras provide rich color and texture information, suitable for recognizing traffic signs, lane lines, and traffic lights, but are significantly affected by lighting conditions, rain, fog, and backlighting. LiDAR provides accurate 3D point clouds with high ranging accuracy and can clearly describe the shape of obstacles, but it is subject to noise in snow, rain, or highly reflective surfaces, and is also costly and requires a large amount of data. Millimeter-wave radar has good penetration and can stably measure the relative velocity of objects, often used to supplement information from cameras and LiDAR when their capabilities are limited in adverse weather conditions. Ultrasonic sensors are used for short-range near-field detection, such as in automated parking scenarios. GNSS+IMU provides the vehicle's global position and short-term attitude, but it becomes inaccurate in tunnels, urban areas with tall buildings, or underground parking lots; therefore, it is usually used in conjunction with onboard vision/radar fine-grained positioning or high-precision maps. Understanding the strengths and weaknesses of each sensor is the first step in designing a reasonable perception system.
Transforming sensor data into usable information involves several stages. The first is data preprocessing, which includes time synchronization, coordinate transformation, and filtering. Time synchronization is a fundamental but often underestimated issue because cameras, radars, and IMUs have different sampling frequencies and latency. If the data isn't aligned, the perception module will "see" the same object in different locations. Coordinate transformation projects data from all sensors onto a unified coordinate system (usually the vehicle coordinate system or the world coordinate system) to facilitate subsequent fusion. Noise removal, point cloud downsampling, image dehazing, and artifact filtering all fall under the category of preprocessing, aiming to reduce interference at the input.
Following preprocessing, the next steps are detection, semantic segmentation, tracking, and localization. Detection and semantic segmentation typically rely on deep learning models to process camera pixel or laser point cloud features, aiming to transform "pixels/points" into representations of "objects." Point cloud detection employs a range of methods, from traditional voxel/mesh-based point cloud processing to more recent end-to-end networks based on point networks, pillars, or multi-view fusion. Camera detection commonly utilizes convolutional neural networks for object detection and segmentation, combined with geometric methods to estimate depth. The tracking module correlates the detection results of each frame into a temporally continuous trajectory. A common approach is to run Kalman filtering or more complex Bayesian filters on the detected states, while simultaneously using the Hungarian algorithm or deep learning similarity metrics for data association. Tracking smooths detection noise, handles short-term occlusion, and provides historical trajectories for the prediction module.
Another crucial aspect of perception is prediction. This prediction doesn't mean "completely predicting the future," but rather providing a distribution of the future with uncertainty. The simplest prediction assumes the target moves at a constant speed or accelerates uniformly. However, in real-world roads, pedestrians, cyclists, and other vehicles exhibit complex steering, braking, and interactive behaviors. Therefore, practical designs often combine physical models, trajectory-based heuristics, and learning models. Learning methods include recurrent networks, graph neural networks, and Transformers, which can learn typical trajectory patterns of traffic participants in different semantic scenarios. More advanced interactive predictions consider the possible reactions of other agents, outputting conditional predictions such as "If I slow down, the other party has an x% probability of doing y." The key is that the output of the perception module is usually not a "single truth value," but rather a state estimate plus uncertainty (probability or confidence). This uncertain information must be passed to the decision-making layer; otherwise, the decision will "overestimate" the accuracy of the perception.
Modern perception systems increasingly rely on deep learning, and in practical engineering, geometric algorithms and learning methods are often combined to balance efficiency and interpretability. For example, traditional point cloud registration algorithms are used for coarse alignment, and then the aligned data is fed into a learning model for semantic extraction; or depth is recovered using multi-view geometry from a camera, and then a learning network is used for detection. The advantage of this approach is that it leverages physical priors to constrain the learning model, reduces reliance on massive amounts of labeled data, and provides interpretable operational behavior even in extreme scenarios. Another practical challenge is computational resources and latency. Point clouds and high-resolution camera data are enormous, and perception modules need to achieve real-time processing (typically 10–30Hz) with limited computing power. This has prompted the use of model pruning, distillation, quantization, and hardware acceleration (GPU/TPU/dedicated accelerators) to compress inference time.
One easily overlooked but crucial point is the error patterns in perception. Occlusion, backlighting, strong reflections, road surface mirroring, atypical appearances of textiles or traffic cones, and echoes from rain and snow particles can all cause perception errors. Therefore, the system must consider uncertainty propagation during the design phase. For example, when perception gives a "pedestrian confidence level of only 0.6," the decision-making layer should be conservative; when the location matching with the high-precision map is unstable, the vehicle should reduce speed and enter a safe mode as soon as possible. Perception does not simply output "where is the object," but rather outputs the state, confidence level, and semantic key information (such as "is it a vehicle or a bicycle") together for the decision-making layer to make a comprehensive judgment.
Autonomous driving decision layer
Autonomous driving decision-making typically involves a layered approach: from higher-level route planning to behavior planning, and finally to local motion planning. Route planning addresses "which road to take from point A to point B," similar to traditional navigation, relying on road network maps and traffic rules. Behavior planning answers "what driving behavior should I perform in the current road segment," such as changing lanes, overtaking, or making a U-turn. Local motion planning translates behavior into a continuous time-space trajectory that satisfies dynamic constraints. Separating these three layers helps break down complex problems, but attention must be paid to interaction and consistency. Higher-level decisions determine irreversible actions (such as entering a ramp), and once executed, local planning and control need to reliably implement and handle external reactions.
There are various methods for implementing behavioral planning. Early engineering practices focused on rules and state machines, such as hardcoding traffic rules, courtesy strategies, and some empirical rules, and implementing the behavioral logic using finite state machines. This approach is interpretable and easy to verify, but it struggles to cover complex and ever-changing scenarios. In recent years, cost function-based optimization methods and learning strategies have been extensively studied. Cost function methods define a set of objectives and constraints, such as safe distance, comfort, time efficiency, and deviation from traffic rules, and then find the minimum cost solution among a set of candidate behaviors or trajectories. Reinforcement learning or imitation learning methods attempt to learn "how to make decisions" directly from data, and can learn more flexible driving strategies in certain scenarios, but their interpretability and verifiability remain challenges. Therefore, in practical products, a hybrid strategy is often used, where rules are used as hard constraints, and the learning model or optimizer is used as a soft policy decision-maker, with a fallback to the rule baseline when there is significant uncertainty.
Local motion planning serves as a bridge between decision-making and control. It generates a traceable time-series trajectory given a specific behavior, typically considering factors such as vehicle dynamics, road surface curvature, obstacle locations, and the predicted trajectory. Trajectory generation can be achieved through sampling-evaluation methods or optimization-based solutions. Sampling methods first generate multiple feasible trajectory samples, then evaluate and select the optimal sample based on a cost function; this method is simple to implement and easily parallelized, but sample coverage affects the results. Optimization methods (such as gradient-based trajectory optimization or pre-planning for Model Predictive Control) directly solve the trajectory problem as a continuous optimization problem, allowing for more refined handling of constraints but requiring significant computation. Some techniques combine fast samplers with local optimization, first generating candidate trajectories (ensuring real-time performance), then using fast optimization for local correction to improve feasibility and comfort.
Another key challenge in the decision-making module is interaction. Other actors on the road will react to your actions, so a good decision requires not only predicting how others will move, but also evaluating their reactions to your actions. Interactive decision-making can leverage game theory, multi-agent reinforcement learning, or conditional prediction models, but these methods are complex and difficult to guarantee convergence to a verifiable policy. Therefore, engineering implementations often opt for a compromise: using a prediction model to weight and evaluate candidate trajectories in normal scenarios, and slowing down or conservatively avoiding paths when risk increases or prediction uncertainty is high, thus ensuring safety without sacrificing too much efficiency.
Decision-making must also align with regulations, ethics, and functional safety standards. Different countries and regions have different legal constraints on autonomous driving behavior, such as clear regulations on lane changing etiquette, traffic light handling, and speed limits in school zones. Decision-making systems must embed these hard constraints into their strategies. For example, ISO 26262 requires risk assessments of potential hazard-causing failure modes and the design of redundancy/failure response mechanisms. From a technical perspective, a common practice is to add a safety envelope to the decision-making module. This envelope acts as a final check after local planning output, ensuring that the trajectory remains within any known safety constraints, such as not exceeding minimum braking distance or maintaining lateral stability limits. If a violation is detected, the trajectory is rejected, and emergency deceleration or stopping is triggered.
Autonomous driving control layer
The output of the decision is the desired trajectory or desired velocity curve, which is sent to the control layer for execution. The task of control is to track the planned desired trajectory into the actual motion of the vehicle. Control is divided into two main dimensions: longitudinal (speed/acceleration/deceleration) and lateral (steering). Some systems combine both into a single joint controller, while others design separate longitudinal and lateral controllers and then coordinate them. The real-world problems that the controller needs to handle include actuator nonlinearity (the nonlinear relationship between the accelerator pedal and vehicle forward movement, and the nonlinearity between the brake pedal and braking force), actuator time delay (the delay between pedal command and actual acceleration), and changes in vehicle dynamics due to load and tire conditions.
Common techniques in longitudinal control include PID control and model predictive control (MPC). PID is simple, reliable, and has mature parameter tuning, making it suitable for regular high-speed cruising. MPC is more suitable for situations requiring consideration of dynamic constraints and predictive information because it can directly introduce controllable constraints (such as maximum acceleration and passenger comfort weights) into the optimization process and generate a control sequence for a future period. However, MPC is computationally complex, requiring real-time solution of the optimization problem and demanding high computing power. Common algorithms for lateral control include geometric methods (such as pure pursuit), state error-based control (such as the Stanley controller), and model predictive control, which also uses MPC. Pure geometric methods are simple to implement and perform well at low speeds, while MPC is more robust at high speeds or under dynamically coupled conditions.
Control design involves more than just choosing an algorithm; it also includes feedforward and feedback structures, delay compensation, and nonlinear compensation for actuator and tire models. Feedforward control uses the curvature of the planned trajectory to provide the expected steering angle, reducing steady-state error; feedback control corrects for actual errors. Delay compensation can reduce deviations caused by leading or lagging during actual execution by predicting future vehicle states or incorporating known delays into the trajectory during planning. More complex systems may incorporate tire mechanics and vehicle yaw dynamics models into the controller to improve stability under extreme conditions.
The controller also needs to work in conjunction with the vehicle's underlying electronic stability system (ESP/ESC), anti-lock braking system (ABS), etc. During emergency braking, the distribution of braking force, the activation of ABS, and the lateral force control of ESC all affect the actual behavior of the vehicle. Therefore, the control system usually needs to be tightly integrated with the underlying vehicle control unit (VCU), or allow the underlying vehicle safety system to prioritize vehicle stability through a low-level interface.
The necessity of combining the three
By combining the perception, decision-making, and control layers, the system architecture appears to be a streamlined process. However, in engineering implementation, multiple redundant paths are designed in parallel. To ensure functional safety, common practices include sensor redundancy (repeated placement of sensors of the same or different types), computational redundancy (separate operation of the main controller and backup controllers with mutual monitoring), and software redundancy (parallel checks by the main algorithm and simplified hard rules). A low-latency safety channel (watchdog) also intervenes immediately in the event of a main link failure.
Testing and validation are the most resource-intensive yet critical parts of the entire system development process. Perception requires a large amount of labeled data covering different weather conditions, lighting, road materials, and rare events. Relying solely on large-scale road testing is both expensive and difficult to cover all aspects; therefore, industry widely uses simulation platforms for massive-scale scenario testing, including programmable traffic participant behavior simulation, extreme weather simulation, and sensor physics simulation. Decision-making requires long-tail testing in interactive scenarios to evaluate the robustness of strategies in rare scenarios. Motion planning and control require closed-loop simulation combined with real vehicle road testing, especially verifying tracking errors and stability under boundary conditions. Hardware-in-the-loop (HIL) and vehicle-in-the-loop (VIL) testing are important means of coupling software with real actuators and electronic control units to verify latency and nonlinear effects. Of course, final road validation (including test tracks and public roads) remains an indispensable step, but the test coverage should be selectively expanded based on the aforementioned multi-level simulation and operational condition validation.
In summary, perception, decision-making, and control in autonomous driving constitute a highly coupled systems engineering project. Perception is responsible for transforming complex, noisy, and uncertain external information into internal states with semantic and probabilistic descriptions; decision-making makes policy choices based on these states and generates trajectories that satisfy dynamic and safety constraints; control precisely executes the trajectory onto the vehicle while addressing actuator nonlinearity and latency. Each layer has mature paths based on classical algorithms, and is also being rapidly pushed to its limits by learning-based methods. However, the most important factors for the practical application of autonomous driving remain robustness, verifiability, and safety. Understanding the division of labor among the three layers and explicitly handling uncertainties at the interfaces are key to transforming research results into reliable products.