How to use point cloud maps obtained from autonomous driving LiDAR for obstacle recognition?

Obstacle recognition method based on point cloud

In fact, LiDAR provides a 3D point cloud—a set of discrete points with coordinates (usually x, y, z, sometimes with additional channels such as intensity and echo). Each point represents the ranging result of a laser pulse reflected from a surface from the sensor. The biggest advantage of point clouds is that they can directly and accurately measure 3D geometric information, which is helpful for determining the shape and distance of objects. However, it is not a neat pixel grid, but a sparse, irregular set of points that is significantly affected by the viewing angle and distance.

Obstacle recognition using point clouds can be divided into several stages: preprocessing (noise filtering, cropping, downsampling, etc.), ground/background separation, point cloud segmentation and clustering, fitting bounding boxes for each cluster and classifying them, post-processing, and tracking. Many technical solutions use classical geometric/statistical methods for fast segmentation and clustering, and then feed the obtained candidates into a learner for classification; in recent years, end-to-end learning-based 3D point cloud object detection has become increasingly mainstream.

Preprocessing, as the first step in obstacle recognition, may seem like just basic processing of point cloud data, but it is actually crucial. A single LiDAR scan contains noisy points (caused by weak reflections or multipath propagation), regions of no interest (such as car roofs or distant buildings), and highly uneven ranging distributions (dense points near the edge, sparse points far away). Common preprocessing techniques include box/view clipping (limiting the z, x, y range to the area of interest), statistical outlier removal (e.g., calculating the average distance of each point to its neighbors and removing significant outliers), and voxel grids or uniform downsampling to reduce the number of points while maintaining processing speed. Oversampling must be avoided, as it can cause distant or small targets to disappear, especially in high-speed scenarios where distant targets are already few in number. Therefore, voxel sizes should be adjusted based on sensor point density and the requirements of the backend algorithm, or a hierarchical downsampling strategy (high density for near targets, low density for distant targets) should be adopted. This stage often determines whether the system can maintain high detection recall under real-time constraints.

Ground separation, in short, is the process of removing "ground/road surface" points to highlight actual obstacles (vehicles, pedestrians, pillars, etc.). Common methods include RANSAC plane fitting based on model fitting, ground extraction based on raster/heightmaps, and ring-by-ring ground classification. RANSAC is advantageous due to its conceptual simplicity and robustness to noise, but it is prone to failure on complex road surfaces (slope, roadbed undulation, obstacle coverage) or sparse point clouds, requiring adjustments to the interior point threshold and iteration count. Heightmap-based methods are more suitable for mobile platforms, projecting the point cloud onto a horizontal raster, identifying the lowest point in each cell, and applying terrain filtering (e.g., morphological filtering or slope thresholding). This is more effective on uneven road surfaces, but requires careful selection of the raster resolution to balance detail and noise.

After removing the ground points, the next step is to divide the remaining points into clusters based on "connectivity" or "density," with each cluster typically corresponding to a candidate object. This is commonly accomplished using Euclidean Cluster Extraction (ECCI) or Density-Based Clustering (DBSCAN) algorithms. ECCI searches for "neighbors" within a certain radius for each point, grouping mutually reachable points into one cluster. This method is fast and simple to implement, and is the standard implementation in PCL. However, it is sensitive to parameters (neighborhood radius and minimum number of points in the cluster) and tends to merge multiple objects when they are close to each other or in high-density backgrounds. DBSCAN handles non-spherical clusters and noise better, but it typically involves more computation and parameter selection is not easily automated. In engineering, a combination of scanline-based segmentation (detecting breakpoints in each horizontal ring of the LiDAR, suitable for lane scenarios) and Euclidean clustering is often used to balance speed and separation capability. PCL has mature modules for this purpose that can be directly called.

For each cluster, a 3D bounding box needs to be fitted and classified. The bounding box can be axis-aligned (AABB) or oriented bounding box; the latter is more practical in lane scenarios because vehicles are typically aligned in a certain direction. Least squares or PCA (Principal Component Analysis) can be used to quickly estimate the predominant orientation of the cluster and generate the oriented bounding box. Traditional methods extract several geometric features for each cluster, such as size, point density, shape histogram (e.g., features based on surface normals), and height distribution, and then use a simple classifier (SVM, Random Forest) to determine whether it is a car, a person, or another obstacle. This approach still has advantages when there are few samples and limited computational resources, and it offers good interpretability (you can see why each feature was misclassified). However, hand-designed features often struggle to cover complex real-world variations such as occlusion, deformation, or point cloud fragmentation caused by reflections from gaps inside vehicles.

Technical challenges and solutions

In recent years, deep learning technologies that directly apply to point clouds have developed rapidly, bringing higher accuracy and less manual engineering, but also introducing high demands for labeled data and computing power. Point cloud deep learning methods can be broadly classified into three systems: point-based methods, voxel-based methods, and pillar/bird's-eye view fusion methods (pillar/BEV). PointNet is the pioneering work of point-based methods. It directly accepts the original point set as input, and obtains a global description unaffected by the point arrangement by performing a shared MLP (i.e., the same multilayer perceptron for each point) and then global pooling. This allows the network to directly process irregular point sets and complete classification/segmentation tasks. The ideas of PointNet were later extended to PointNet++ to handle local structures, becoming an important cornerstone of point cloud processing.

However, in the "real-time detection" task of autonomous driving scenarios, directly feeding the raw point cloud into a point-level network is computationally inefficient. Therefore, some solutions propose converting the point cloud into a regular tensor first. VoxelNet divides the point cloud into voxels (3D voxels), and within each voxel, a small network (VFE) encodes the set of points into a fixed-dimensional representation. Then, the voxelized features are used for detection by a convolutional network. This method balances the detail of the points with the parallel computation of the network. SECOND introduces sparse convolution on top of VoxelNet, which greatly accelerates the convolution computation of sparse voxels, achieving a better balance between speed and accuracy in the voxel method. PointPillars proposes to first project the point cloud onto a vertical "pillar" to generate a feature map for each cell on a 2D grid, and then use conventional 2D convolution for subsequent processing. This design transforms a 3D problem into a 2D convolution problem, greatly improving speed and performing well on benchmarks, and has become a widely adopted compromise solution. These end-to-end depth detection methods have driven significant performance improvements on public benchmarks (such as KITTI), but they also introduce a reliance on large amounts of labeled point clouds and higher requirements for inference computing power.

A series of engineering problems arising from model training and deployment also need to be addressed. First, there's data annotation and augmentation. Point cloud annotation is costly (requiring precise annotation of 3D bounding boxes), and different LiDAR models have varying point densities and fields of view, resulting in poor transferability of trained models across different sensors or scenes. Therefore, common practices include simulation data augmentation, projecting point clouds onto a BEV (bird's-eye view) or combining camera images for multimodal training, and simulation noise enhancement such as scaling/rotation/point dropping on training samples. Second, there's the class imbalance problem. In road scenes, there are far more vehicles than pedestrians or cyclists, easily leading to a bias towards the majority class during training, requiring mitigation through sampling, loss weighting, or online hardnegative mining. Third, there's the real-time issue. On automotive-grade platforms, with limited computing power, a trade-off must be made between model accuracy and latency. Common strategies include using lightweight point cloud preprocessing to quickly generate high-recall candidates, followed by a heavier secondary classifier for precise judgment, or distilling and quantizing the detection model to reduce latency.

The impact of environmental and weather conditions on LiDAR must be carefully considered. Rain, snow, and fog can reduce echo intensity and introduce false speckle, leading to increased point cloud noise and loss of distant targets. LiDAR is also affected by surface materials (highly reflective or light-absorbing materials) and glass, causing point cloud holes or false reflections. In these situations, simple geometric thresholding often becomes unreliable, making it necessary to use multi-time-smoothing fusion, motion compensation (because the platform of a rotating LiDAR may have moved during scanning, requiring distortion correction), and sensor-level fusion with cameras/millimeter-wave radar. Motion compensation typically relies on IMU/odometer information to restore the point cloud from point-by-point timestamps to the same reference time, reducing deformation caused by vehicle motion, which is especially important in high-speed driving or rotating scanner (VLP/Velodyne) scenarios.

Beyond detection, tracking is a crucial step in ensuring stable perception within the system. Associating bounding boxes detected frame by frame, maintaining target IDs and trajectories, and using Kalman filtering (standard Kalman, extended Kalman, or unscented Kalman), the Hungarian algorithm for inter-frame matching, or trajectory-based ReID features for association, can significantly reduce the instantaneous decision-making risks caused by false detections/missed detections. The accuracy of velocity estimation and trajectory prediction directly impacts the planning and decision-making modules; therefore, some solutions deploy tracking as a necessary yet lightweight module after detection.

How do we evaluate the effectiveness of the test?

Common metrics include mean accuracy (mAP) calculated based on intersection-over-union (IoU), recall and false positive rates at different distances, and latency (the time from point cloud to detection result). Recognized public benchmarks such as KITTI provide standardized test sets and evaluation methods. Many algorithms use KITTI 3D detection scores as a benchmark for comparison. However, it's important to note that KITTI's sampling and annotation have biases (e.g., focusing on vehicles and urban roads). Model performance may differ in other scenarios (elevated roads, highways, rural areas). Therefore, engineering validation must cover the target deployment environment.

In engineering practice, some combined strategies are often more reliable. For example, one can first use fast, classic methods to generate high-recall candidates (e.g., using voxel grids + Euclidean clustering to obtain a batch of candidate clusters), and then fuse these candidates with the camera's 2D detection or the classification head of the depth model for judgment; or, when computing resources allow, directly deploy an end-to-end depth detection model (e.g., PointPillars, SECOND) and combine it with lightweight post-processing (Non-maximum suppression, tracking) to ensure stable output. If the project is geared towards mass production or automotive-grade applications, it is also necessary to consider model compression, fixed-pointing, and inference performance testing on different hardware (CPU, GPU, DSP, NPU).

Let's discuss some details and common pitfalls. First, the ground separation threshold and grid resolution are two parameters most frequently mistuned. Improper settings can lead to misclassifying low obstacles as ground or ground fragments as obstacles. Second, the cluster radius and minimum cluster size directly determine whether adjacent vehicles or small targets like pedestrians can be distinguished. Third, neglecting point cloud distortion compensation can cause consecutive frame matching failures during turns or uneven speeds. Fourth, the intensity channel is often underestimated, but it is actually very useful for distinguishing glass/reflective objects from real entities and should be utilized in features or network input. Fifth, distant targets often have very few points, which deep learning models easily overlook. In this case, special hard example augmentation is needed during training, or different processing channels are used for near and far objects at the perception system level to improve far-range recall.

If you're performing point cloud detection within a team, you need to build a diverse annotation set that covers different sensors, different line-of-sight distances, day and night conditions, and various weather conditions such as rain and snow. Utilize semi-automatic annotation tools (such as batch annotation based on trajectory propagation) to reduce manual labor costs. Data augmentation is also crucial. Random rotation (around the vertical axis), scaling, random point deletion (simulating point loss), and inserting labeled objects into different scenes (stitching augmentation) are common techniques that can significantly improve the model's robustness.

Further discussion of engineering trade-offs in deep learning reveals that point-level networks (PointNet/PointNet++) are sensitive to small targets and details, but have high computational costs in large-scale scenarios. Voxel/sparse convolution methods are easier to optimize in terms of speed (sparse convolution can operate efficiently on sparse voxels) and are easier to integrate with existing convolution acceleration libraries. Pillar/BEV methods are the easiest to implement in terms of speed and engineering, as BEV representation can directly leverage the mature ecosystem of 2D convolutional networks and align with camera BEV or high-definition map data. In actual products, these methods are often implemented in a multi-level architecture: lightweight fast detection for prediction, rigorous detection for confirmation, and a tracking module connected in series to ensure stable output.

Future development direction

Looking ahead, several directions warrant attention. Sensor fusion will become increasingly prevalent, with cameras providing rich semantic and color information, greatly aiding in category recognition; millimeter-wave radar is more robust to adverse weather conditions, compensating for the shortcomings of lidar in rain and snow. In model-related technologies, semi-supervised learning, weakly supervised learning, and simulation data (synthetic point clouds) will alleviate annotation bottlenecks; simultaneously, the Transformer and large-scale model approach is expanding into the point cloud domain, attempting to integrate multimodal information using more general representations. In engineering, robustness (stability to weather and sensor variations) will be prioritized over a single benchmark score.

Finally, here are some practical suggestions: If you are starting out with point cloud obstacle recognition, it is recommended to first implement a stable, classic pipeline: point cloud denoising → ground separation (grid or RANSAC) → Euclidean distance clustering → geometric feature classification → tracking. This process allows you to obtain stable and usable detection results in a very short time, facilitating integration with planning/control systems. Simultaneously, develop deep models based on PointPillars or VoxelNet, treating them as a second-order enhancement method. Regarding data, prioritize building multi-scene annotation sets and implementing robust data augmentation strategies. During deployment, emphasize latency testing and model compression (quantization, distillation). For complex environments (tunnels, snowy conditions, highways), try to incorporate multi-sensor fusion and temporal information to improve robustness.

How to use point cloud maps obtained from autonomous driving LiDAR for obstacle recognition?

Read next

CATDOLL CATDOLL 115CM Shota Doll Nanako (Customer Photos)

CATDOLL 133CM Nanako Shota Doll

CATDOLL 108CM Bebe

CATDOLL Airi Soft Silicone Head