Developing descriptors for LiDAR points using contrastive learning

Aaron Ruof

Duration: 6 months
Completion: August 2024
Supervisor: M.Sc. David Skuddis
Examiner: Prof. Dr.-Ing. Norbert Haala

Introduction

Many methods for generating descriptors for 3D points require a dense point cloud. However, simple handheld LiDAR scanners in particular produce a sparse point cloud for which many known methods no longer produce good results (Serafin, 2016).

In order to obtain good descriptors related to the geometric properties of the points for these point clouds, new methods are needed that are specifically designed for such sparse point clouds from LiDAR scans. There do not exist that many of them, so a new one is developed. The method uses a neural network that is trained using a self-supervised learning approach. In this way manual labeling of a large amount of data is avoided. Contrastive learning with a Siamese network (Bromley, 1993) is found to be the approach that can achieve the desired goal. With this approach, the method can be trained merely using SLAM datasets such as the Cloister Record of the Newer College dataset.

Methodology

The architecture is realized as a Siamese network. An adapted RangeNet++ (Milioto, 2019) net is used for the identical subnets, which provides the 128-dimensional descriptors as a result.

Figure 1: Adjusted RangeNet++ architecture

Before the network can be trained, the data must once be prepared in a pre-processing step. First, the point clouds should be projected into images as images are used in the neural net. Second, a relationship between the scans is required for contrastive learning. In this case, identical points are searched and they can be determined using the trajectory from the Ground Truth file. To do this, all scans are transformed into one global point cloud. Then the point cloud is divided into 5 cm wide voxels. All points within the same voxel are considered identical. The information about which point is within which voxel is then saved and used during the training process.

During training of the neural net, two images are grabbed and compared. Once all identical points have been identified, the range images are processed using the network. The objective of the loss function is to minimize the similarity between the descriptors of identical points while simultaneously maximizing the variance of all descriptors within an image.

To obtain the best results, multiple experiments are carried out to find the best configuration for the procedure.

Visualizations

To facilitate the evaluation of the results, several visualization methods have been developed. These should allow for the evaluation of the quality of the results and the identification of parameters that require adjustment in order to achieve the optimal outcome during the experiments.

Figure 2: Descriptors divided in 44 Clusters

In one illustration, an image is partitioned into clusters to identify the locations of points with similar descriptors. It is anticipated that a multitude of clusters will be situated in regions of geometric conspicuousness. This phenomenon is observed at some windows, but it is not observed at the statue. This indicates that the result is not optimal and that the descriptors are highly similar, even in geometrically conspicuous regions.

Figure 3: Point distance and descriptor distance for point correspondences

An alternative approach is to analyze point correspondences, which can be identified between two scans. They are visualized in terms of the distance between the paired descriptors and the distance between the points in the global point cloud. The result also demonstrates that correspondences can be found whose descriptors are highly similar, even though the points being several meters apart.

Conclusion

The visualizations demonstrate that the method is not yet delivering the desired results. Although the approach seems to be working, there are still issues that result in unusable outcomes for the method. The descriptors appear to show a high degree of similarity overall. Consequently, applications like the matching of point clouds, for instance, lead to poor results and is therefore no longer effective.

To resolve this issue, the loss function must be modified. In particular, the component that aims to make the descriptors as diverse as possible is not functioning optimally. The current approach involves reducing the reciprocal of the variance of all descriptors. However, the reciprocal does not appear to be an optimal choice. The expectation is that an adjusted loss function without the reciprocal variance should deliver substantially better results.

Sources

Bromley, J. a. (1993, 08 01). Signature Verification using a "Siamese" Time Delay Neural Network. International Journal of Pattern Recognition and Artificial Intelligence, p. 25.

Milioto, A. a. (2019). RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4213-4220.

Serafin, J. a. (2016). Fast and robust 3D feature extraction from sparse point clouds. 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(10.1109/IROS.2016.7759604), 4105-4112.

Developing descriptors for LiDAR points using contrastive learning

Aaron Ruof