Pure vision-based autonomous driving refers to a vehicle's autonomous driving system that relies solely on cameras without using active sensors such as LiDAR or millimeter-wave radar. Tesla pioneered this approach, claiming that simulating human driving can be entirely achieved through vision. In China, there was also a period when automakers enthusiastically embraced pure vision solutions. However, by 2025, the hype surrounding pure vision had gradually faded, especially with the current strong emphasis on "safety first in intelligent driving," making its advantages seem less pronounced. So, what safety issues might pure vision-based autonomous driving bring? Today, we'll discuss this topic at the forefront of intelligent driving.
Perceptual limitations
As passive sensors, cameras are highly susceptible to changes in lighting and weather conditions. In complex weather conditions such as rain, snow, fog, and haze, images captured by cameras are prone to blurring or reduced contrast, leading to a significant decline in perception performance. Insufficient light at night or in backlit scenes also severely limits the camera's ability to acquire information. These problems can easily create blind spots or false detections. Several Tesla Autopilot incidents have occurred because the system misidentified a white truck as part of the sky background and failed to recognize it.
Furthermore, pure vision systems require inferring three-dimensional spatial information from two-dimensional images. This means that vehicles must use multiple cameras and complex algorithms to estimate distance and shape, but this "2D to 3D" process has inherent limitations. Algorithms need to extract features such as depth and velocity from images, and the latency and errors are difficult to completely eliminate. In addition, pure vision solutions face significant challenges in converting two-dimensional images into three-dimensional information and understanding the details of complex scenes. For example, Tesla's cameras cannot directly measure the depth and velocity of objects. For safety reasons, after removing millimeter-wave radar, Tesla initially limited the maximum speed of Autosteer to 120 km/h and increased the following distance, only later relaxing these limits slightly. It is evident that solutions relying on a single camera are insufficient in distance, depth, and velocity detection, and cannot match the direct measurements provided by lidar or millimeter-wave radar.
Environmental adaptability
Pure vision-based solutions rely on massive and diverse image data to adapt to different environments, while real-world road environments vary greatly. Faced with driving experience and safety requirements, vision systems encounter extremely complex environmental adaptation challenges. Traffic environments in different countries also affect recognition performance. In China, highways are winding and complex, road networks have numerous interchanges and roundabouts; pedestrian and electric vehicle traffic habits in urban areas also differ significantly from those in the West. Statistics show that autonomous driving at intersections in the US is nearly ten times simpler than in China, making the implementation of pure vision-based solutions in China even more challenging.
Because pure vision systems rely solely on real-time camera perception and lack prior information beyond line of sight and the assistance of high-precision maps, their "field of vision" is often limited to the area directly visible to the camera. When Tesla's Full Self-Driving (FSD) system was first rolled out in China earlier this year, many bloggers tested it and found significant "acclimatization issues," struggling to drive smoothly without training on local Chinese data. In contrast, other manufacturers, utilizing prior information from LiDAR, high-precision maps, and positioning systems, were better able to handle complex road conditions. In short, environmental differences limit the generalization ability of pure vision systems; they may misjudge situations when encountering road markings, traffic signs, or driving habits not covered in the training data.
Insufficient system robustness
Highly reliable autonomous driving requires systems with multiple redundancies and fault tolerance. Pure vision systems, relying solely on cameras, inherently lack the complementarity and redundancy of other sensors. If a camera is damaged (e.g., the lens is obstructed by raindrops or dirt) or misjudges (halos, glare, etc.), the entire perception chain collapses, and the system has no backup data source to correct errors. This also makes "phantom braking" a difficult problem to eradicate for pure vision systems. In pure vision systems, the speed and acceleration of the vehicle in front cannot be directly measured, and the vehicle may sometimes brake suddenly without warning to avoid a predicted collision. According to the China Securities Journal, the National Highway Traffic Safety Administration (NHTSA) stated in a regulatory filing that after Tesla removed its millimeter-wave radar in 2021, the number of phantom braking complaints surged from 354 to 758 within a month, triggering a large-scale investigation by US regulators.
The safety design of pure vision systems offers virtually no functional safety guarantees from multi-sensor systems. To meet the safety requirements of high-level autonomous driving, it is essential to prevent risks arising from the failure of a single system; however, pure vision solutions struggle to meet this requirement. For example, Tesla's Autopilot, positioned as Level 2 driver assistance, requires the driver to constantly monitor the driving environment; even so, the National Highway Traffic Safety Administration (NHTSA) has identified hundreds of Autopilot-related accidents, raising questions about its safety. Therefore, due to the lack of redundancy from multi-sensor systems, pure vision-based autonomous driving systems have significant shortcomings in fault tolerance and robustness, making their safety inferior to multi-sensor fusion solutions.
Model generalization ability and the long tail problem
The perception capabilities of purely visual solutions primarily rely on deep learning models, and model performance is highly dependent on the coverage of training data. Due to the extreme diversity of real-world driving scenarios, it's difficult for models to encompass all possible scenarios through conventional training. Those "long-tail" scenarios that occur infrequently in the dataset (such as rare traffic signs, unconventional obstacles, and unexpected accidents) often cannot be adequately trained. As a result, the model may mispredict in these scenarios and fail to react correctly. To address the long-tail problem, methods such as large-scale data collection, data augmentation, and simulation are needed to expand the training samples, but even these methods cannot guarantee coverage of all extreme cases.
Differences between external data and local applications can also lead to insufficient generalization. For example, Tesla's FSD system is primarily trained on North American road conditions, which is not suitable for the complex highway environment in China. China has strict regulations on the security of autonomous driving data, and the data collected by Tesla in China is difficult to transfer, which further limits localized model training. In short, pure vision systems require massive amounts of high-quality and diverse training data to improve generalization ability, but in practical applications, acquiring and labeling such data is both expensive and time-consuming, making it difficult to quickly compensate for the model's shortcomings in new environments.
Future Trends and Technological Evolution
While pure vision solutions have unique advantages in terms of cost and algorithmic innovation, the industry generally believes that true large-scale deployment still requires a combination of sensor fusion and more advanced AI technologies. A single sensor cannot cover all scenarios; achieving highly reliable environmental perception in the short to medium term inevitably relies on the fusion of multiple sensors. This is especially true in Level 4 autonomous driving, where LiDAR and cameras are equally important and irreplaceable from a safety perspective—neither can be dispensed with.
The forefront of intelligent driving technology believes that the future path of intelligent driving may involve continuing to develop end-to-end large models and optimize visual algorithms, while retaining auxiliary sensors such as millimeter-wave radar or LiDAR to balance accuracy and robustness. For example, Tesla's latest FSDV 12.5.1 version reportedly introduces an end-to-end neural network architecture, significantly refactoring the underlying code in an attempt to further improve the decision-making performance of pure vision systems. Furthermore, traditional autonomous driving companies and the supply chain are increasing their investment in low-cost solid-state radar, LiDAR, and high-precision maps to add diverse safety redundancies to in-vehicle perception.
In summary, vision-based autonomous driving has advantages in terms of cost and market potential, but it places extremely high demands on the algorithmic capabilities and data support of the perception system. Real-world examples show that camera-based solutions still have reliability vulnerabilities and require careful evaluation and reinforcement. Future development may be more balanced, leveraging advancements in artificial intelligence and visual algorithms while also utilizing multi-sensor fusion to ensure safety in complex environments.