Improving True Ortho quality by removing moving vehicles

Hannes Nübel

Duration: 6 months
Completition: May 2022
Supervisor: Dr. Patrick Tutzauer nFrames GmbH, Prof. Dr.-Ing. Norbert Haala
Examiner: Prof. Dr.-Ing. Norbert Haala

Introduction

Geometrically accurate representations like True Orthophotos (TOs) and textured meshes generated from aerial imagery are utilized progressively in many fields. Because of the growing interest in such products, these are optimized on many fronts like geometry, texturing, or processing time. An important customer need regarding the TO is an appropriate handling of moving vehicles. Apart from the ability to remove such objects from the resulting product completely, it is especially desired to prevent possible artifacts that come from a blending of varying textures referred to as ghost cars (Figure 1). This thesis describes an approach to remove moving vehicles from TOs by masking them in the aerial imagery with help from the information out of depth images using a Convolutional Neural Network (CNN). Those masks are then deployed for the texturing of the TO.

Figure 1: Ghost cars induced by blending the texture of vehicles and street from different aerial images.

Methods

The mask for an aerial image is derived using an adapted and trained version of the Mask R-CNN, which is performing instance segmentation to derive a semantic mask for each moving vehicle in the corresponding image. In addition to the color images captured by the camera, the depth image is used as a second input to the CNN. Although it is possible to check if certain objects are stationary in a series of images, this information is not contained in a single color image. Nevertheless, the reconstruction pipeline SURE is producing depth images as an intermediate product from which this can be retrieved. It is holding the distance of the ground points to the camera for the matched pixels in the base image. However, these depth images cannot be completely filled with values. First of all, depth values can only be derived in areas where the used images are overlapping, so that corresponding pixels in stereo image pairs can be matched. Apart from that and occluded areas, pixels of objects that are moving between the capturing of images also cannot be matched, because the pixels in this location show a different object than the pixels of another image mapping the same region. In the case of moving cars, this means that an image that contains a car at a certain position will show the surface of the road for the same region in a different image. This results in holes in the depth image when a moving vehicle is captured. Moving vehicles can therefore be detected for each image when using the color information from an RGB image, to identify vehicles and the depth image, to tell if they are stationary (Figure 2). The resulting masks for the moving vehicles derived from the CNN are then used in the TO texturing process to exclude those pixels in the color images and use the color information from different views instead.

Figure 2: RGB with transparent overlaid color-coded depth image (NaN values for depth image shown as no data) showing the difference between moving and stationary vehicles in the depth image.

Results

Aerial images with moving vehicle masks

Looking at Figure 3, one can see that the CNN is nicely detecting cars and is also able to distinguish between moving and stationary. It also shows a very challenging issue when vehicles are not constantly and/or only slowly moving. In this case, some points on the vehicle can be matched between images, resulting in an inconsistent set of depth values as for the cars coming from the bottom and turning right.

Figure 3: Comparison of ground truth and prediction (data set C) with moving, stationary, and partly moving vehicles.

Masked True Orthophotos

Figure 4 is showing segments of a TO with moving vehicles and the standard texturing approach on the left. The masked version, excluding the corresponding pixels in the aerial images from the texturing of the TO, is shown on the right. It can be observed that moving vehicles are nicely removed from the TO while the stationary ones (e.g. at traffic lights or parking spaces) retain their visualization (compare Figures 5.10a+b). The approach is removing moving vehicles that are displayed completely in the reference, as well as ones that are blended, cut, textured multiple times and combinations of such effects that are especially unpleasant.

Figure 4: TO comparison reference (left) and masked (right).

Conclusion

The approach introduced in this thesis proved to fulfill the need of removing moving vehicles from TOs for a high percentage of objects, creating a better view of the road networks, as well as getting rid of many unpleasant effects like ghost cars coming from an inconsistent blending of textures from moving vehicles. Since it is not changing the actual texturing procedure, the illustration of areas that do not have to be masked is staying the same. Also, areas with masked moving vehicles mostly look similar, because the texturing is performed equally just by not considering the specified area from an image. However, the similarity is of course dependent on the views available for the corresponding area.

Two things appeared to be primarily challenging when implementing the approach. One is the necessity for dense depth maps to be able to reliably distinguish moving from stationary vehicles. The second thing is the handling of vehicles that cannot be labeled explicitly as moving or stationary for all images in which they are captured. This may lead to vehicles not being removed (i.e. appearing as for the reference TO), although they might cause artifacts. Nevertheless, the majority of moving vehicles are being detected and removed especially for flowing traffic, creating a more appealing end product.

References (selection)

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. http://www.deeplearningbook.org/">//www.deeplearningbook.org. MIT Press.

He, Kaiming et al. (2017). “Mask R-CNN”. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.

Kraus, Karl (2011). Photogrammetry: Geometry from Images and Laser Scans. De Gruyter. Chap. 5.3, pp. 271–276. ISBN: 9783110892871. DOI: doi:10.1515/9783110892871.

Ren, Shaoqing et al. (2015). “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: Advances in neural information processing systems 28.

Rothermel, Mathias et al. (2012). “SURE: Photogrammetric surface reconstruction from imagery”. In: Proceedings LC3D Workshop, Berlin. Vol. 8. 2.

Improving True Ortho quality by removing moving vehicles

Hannes Nübel