First, it's important to understand that cameras and LiDAR operate on fundamentally different principles. Cameras acquire two-dimensional color images through optical lenses and image sensors, recording information such as color, texture, and lighting of the scene. LiDAR, on the other hand, emits laser pulses and measures the time-of-flight (TOF) of the light pulses from emission to reception, directly calculating the distance to objects and generating a high-precision, three-dimensional point cloud. The dimensions and nature of the information they acquire also differ. Cameras excel at extracting texture and semantics but lack direct physical depth measurement capabilities; LiDAR measures distances with millimeter to centimeter precision but lacks color and detailed texture information.
To achieve 3D spatial reconstruction, cameras need to use binocular stereo vision or deep learning algorithms to infer depth. Binocular vision uses the parallax between the left and right cameras for triangulation, which is accurate at short distances. However, as the target distance increases, the parallax decreases, and the depth error amplifies rapidly. Furthermore, planes with sparse texture, direct sunlight, or shadowed areas can all cause feature matching failures, further increasing the depth error. While deep learning models based on monocular depth estimation perform well on some public datasets, they essentially rely on statistical inference. If the training data differs from the actual driving scenario, misjudgments will occur. Moreover, monocular networks can only output relative depth; they need to be combined with other information such as odometer readings to reconstruct absolute distance, and this external information itself introduces additional errors.
Meanwhile, cameras are extremely sensitive to lighting conditions. At night, in tunnels, or when driving against direct sunlight, images are prone to noise or overexposure, severely impacting the accuracy of target detection and tracking algorithms. Even adding infrared illumination or high-sensitivity sensors increases system cost and power consumption. In contrast, LiDAR is almost unaffected by visible light and can operate normally in low light or even complete darkness, ensuring accurate distance measurement at night. Furthermore, severe weather significantly impacts cameras. Fog scatters visible light, drastically reducing image contrast and blurring outlines; heavy rain causes image distortion due to raindrops adhering to the lens; and snow can obscure lane lines and obstacles. While image restoration algorithms such as defogging and deraining can alleviate these problems to some extent, restoring the image to a flawless state is extremely difficult in real high-speed driving conditions. LiDAR is also affected by water droplets and snowflakes in rainy and snowy weather, but through multi-pulse filtering, intensity suppression, and hardware optimization, it can filter out clutter and maintain ranging stability to some extent.
Undeniably, cameras possess advantages in color and texture, enabling them to perform tasks such as pixel-level semantic segmentation, traffic sign recognition, and lane detection. Deep learning-based semantic segmentation networks can accurately distinguish between different categories of information, including pedestrians, vehicles, and buildings, providing rich context for autonomous driving decisions. LiDAR, on the other hand, only provides sparse point clouds and lacks color information, requiring classification and segmentation through point cloud deep learning algorithms. While less intuitive than images, significant progress in point cloud deep learning technology in recent years has continuously improved the performance of LiDAR in semantic segmentation.
Currently, autonomous driving systems often employ multi-sensor fusion, tightly coupling LiDAR and camera data to leverage their complementary strengths. For example, projecting point clouds onto an image plane, using deep learning networks for semantic segmentation, and then registering the image with LiDAR point clouds allows for the simultaneous acquisition of high-precision 3D geometric information and rich semantic labels. This way, even in low-visibility conditions at night, LiDAR can supplement depth information; in traffic sign recognition scenarios, the high-resolution color images from cameras make it easier to identify sign details. Relying solely on pure vision for 3D ranging and semantic understanding can lead to perception system malfunctions in the event of sudden changes in lighting or occlusion, thereby jeopardizing driving safety.
The reason for researching pure vision-based systems is primarily due to cost considerations. Cameras cost only a few hundred to a few thousand RMB, while high-precision multi-beam LiDAR can easily cost tens or even hundreds of thousands of RMB. Many automakers have attempted to reduce sensor costs through pure vision solutions, but to meet the safety redundancy and regulatory compliance requirements of autonomous driving systems, they must opt for higher-resolution, higher-sensitivity industrial-grade cameras or add infrared auxiliary equipment, the cost of which is already close to or exceeds that of low-end LiDAR. Extracting and inferring depth information and running complex image algorithms also requires a more powerful computing platform, significantly increasing computing costs and power consumption. In contrast, the point cloud data output by LiDAR is itself a geometric physical quantity, the backend processing chain is relatively simple, and the computing power requirement is lower. Overall, LiDAR may not be as prohibitively expensive as imagined.
In terms of reliability and redundancy design, LiDAR also performs better. Leading LiDAR manufacturers such as Velodyne, Innoviz, and Ouster continuously optimize hardware and heat dissipation structures to ensure stable performance in harsh environments such as high and low temperatures, vibration, rain, and snow. When cameras experience extreme temperatures or severe vibrations, the lens may experience focus drift, image blurring, or sensor noise, affecting image quality and algorithm output. If a camera fails or its performance degrades significantly, redundancy must be ensured by other sensors, and LiDAR is the most reliable backup sensor. If LiDAR is abandoned and only a combination of camera and millimeter-wave radar is used, blind spots still exist when detecting small targets at a distance (such as pedestrians and cyclists); millimeter-wave radar has lower resolution and cannot accurately distinguish the fine contours of nearby obstacles, let alone generate high-precision 3D maps.
LiDAR also boasts significant advantages in high-precision map building and real-time positioning. Dense 3D point clouds can be directly used to construct high-precision maps, recording static environmental features such as roadside guardrails, curbs, and buildings, providing reliable references for vehicle positioning. While visual SLAM (Simultaneous Localization and Mapping) technology continues to advance, feature point extraction and tracking are prone to failure in scenarios with drastic lighting changes, repetitive textures, or low-light conditions, leading to positioning drift. LiDAR SLAM, based on high-precision distance measurement, can achieve stable positioning even at night or in low-light environments, exhibiting higher overall robustness. To build high-precision maps comparable to LiDAR using a purely visual solution requires massive investment in calibration, manual correction, and algorithm development, significantly increasing cost and complexity.
Of course, in some applications such as automated warehouse handling, campus inspections, or low-speed Robotaxi applications where the speed is low and the scenario is controllable, a pure vision solution combined with millimeter-wave radar or ultrasonic sensors can achieve relatively robust perception results at a lower cost. However, when it comes to highways, high-density urban roads, or scenarios with variable weather, cameras alone are insufficient to ensure adequate safety. The high-precision, high-frame-rate 3D point cloud provided by LiDAR can reduce speed and distance measurement errors, giving the system more reaction time and significantly improving driving safety.
With technological advancements, LiDAR is rapidly iterating towards miniaturization, lower cost, and higher precision. Solid-state LiDAR achieves beam scanning without mechanical rotation through silicon photonics or MEMS micromirrors, resulting in continuously decreasing costs and size while increasing reliability. As production scales up, LiDAR prices are expected to gradually approach affordable levels, further narrowing the cost gap with cameras. However, achieving LiDAR-level ranging and robustness performance for pure vision in all driving scenarios requires significant breakthroughs in both algorithms and hardware, which are unlikely to be realized in the short term.
From an algorithmic perspective, deep learning can train networks on massive amounts of data to extract image features and perform deep inference based on visual content. However, this is ultimately an empirical perception, lacking the interpretability and determinism of physical measurement. When encountering scenarios such as unfamiliar roads, unusual building appearances, or novel obstacles under different weather conditions that are not covered by training data, pure vision systems may experience blind spots or misjudgments. LiDAR's point cloud output represents true geometric distances, and noise and errors can be quantitatively processed during the filtering stage, resulting in greater interpretability and providing a more stable input for the decision-making module.
For many consumers, the presence of LiDAR demonstrates greater safety for autonomous vehicles. Users are more likely to trust the vehicle's perception capabilities when they see LiDAR installed on the roof or near the windows. While pure vision solutions have performed well in demonstrations, user concerns about relying solely on cameras for distance measurement remain. In the short term for commercialization, LiDAR is not only a technology choice but also a symbol of brand and safety commitment.
In summary, although pure vision perception technology has made significant progress in object detection, semantic segmentation, and depth estimation, and has certain cost advantages, it cannot completely replace LiDAR due to its inherent limitations, such as the inability to obtain high-precision physical distance, sensitivity to lighting and weather conditions, high computational dependence, and insufficient interpretability. LiDAR, with its high precision, robustness, and good environmental adaptability, remains the core sensor in autonomous driving perception systems. The optimal solution for the future remains the cross-modal fusion of cameras with LiDAR, millimeter-wave radar, and other sensors to build a multi-redundant, multi-dimensional, all-scenario perception system, providing a higher level of safety and intelligent driving experience for autonomous driving.