Conditional Infrared Image Synthesis for Semantic Segmentation

Christian Mayr

Duration: 6 months
Completition: April 2024
Supervisor: Dr. Michael Teutsch, Christian Kübler (beide Hensoldt)
Examiner: Prof. Dr.-Ing. Norbert Haala

Introduction

Semantic segmentation, a fundamental task in computer vision, involves assigning a class label to each pixel in an image based on its semantic meaning. Primarily used in automotive scenarios, it has applications in diverse fields such as remote sensing and medical imaging. Despite the effectiveness of modern deep learning methods, they rely heavily on extensive training data, which poses a challenge for fully supervised learning. As a result, the availability of large public datasets for semantic segmentation remains limited. Recent efforts aim to reduce annotation complexity through approaches such as procedural generation of synthetic data. However, bridging the domain gap between synthetic and real data remains a challenge.

Moreover, while semantic segmentation has been extensively studied in the visual spectrum (VIS) domain, its relevance in the infrared (IR) domain is underexplored. To address this gap, this thesis applies conditional image synthesis methods using the diffusion-based model ControlNet [8]. By transfer learning ControlNet [8] to the IR spectrum using the Full-Time Multi-Modality Benchmark (FMB) [3] dataset, a large number of synthetic IR images are generated. These synthetic images will be used to train a Transformer-based semantic segmentation algorithm - SegFormer [7]. The main contributions include proposing a method for synthesizing IR images based on ControlNet, training the Transformer-based semantic segmentation algorithm SegFormer on these synthesized data, and demonstrating through extensive experiments the effectiveness of the proposed approach in achieving near state-of-the-art (sota) performance in IR semantic image segmentation, with improved generalization capabilities compared to models trained only on real data.

Methodology

Figure 1: This figure visualizes the pipeline presented in this thesis. In the Synthesis Training block the used ControlNet [8] architecture is provided with a text prompt and GT IR images alongside corresponding semantic annotations out of the FMB [3] dataset in order to train a model that is able to synthesize high-quality IR images. In the second Data Generation block the found IR synthesis model is used to create a large fully synthetic dataset by creating multiple IR images for each used semantic map out of the FMB dataset. In the final Segmentation Training block the large synthetic IR dataset alongside the corresponding semantic annotations is used to train a semantic segmentation model using the SegFormer [7] architecture. The same text prompt is used for training an image synthesis.

In order to achieve the set goal of improving the performance of semantic segmentation approaches working on IR data the pipeline shown in figure 1 is proposed. The pipeline and the employed methodology are divided into three principal blocks.

The initial phase consists of the training of the IR image synthesis model. For this task, ControlNet, an advanced image synthesis architecture, is adapted to synthesize IR images conditioned on semantic maps. By training ControlNet with Ground Truth (GT) IR images and semantic maps, high-quality IR images are synthesized while ensuring that the semantic composition of the resulting image is in line with the provided semantic map. The training data for this, is extracted out of the FMB [3] dataset. This method enables the creation of multiple IR images based on a single semantic map. While differently looking, those images maintain consistency in their semantic composition across all synthesized images due to the control exerted by the provided condition - the semantic map. This control is exerted by ControlNet [8] on its synthesis backbone Stable Diffusion (SD) [4].

The image synthesis training is followed by the generation of a large synthetic dataset using the previously trained IR image synthesis model. Leveraging the proposed approach, the size of the resulting synthetic dataset is significantly increased, in comparison to the dataset providing the input semantic maps - the FMB [3]. Generating ten IR images for each provided semantic map.

Finally, a semantic segmentation approach - SegFormer [7] - is trained and evaluated on the aforementioned synthesized dataset. SegFormer [7] has been demonstrated to be effective in IR image segmentation tasks in works including [3, 1].

Experiments & Results

In order to be able to effectively evaluate the benefits for semantic segmentation approaches that can be achieved by the proposed approach multiple segmentation models were trained and compared. For these two other methods that are capable of generating IR images were selected. These comparative methods are ASAPNet [5], a Generative Adversarial Network (GAN)-based image synthesis approach and so-called RGB2TIR [2] a GAN-based image translation approach. Like the proposed approach both comparative methods have been used to generate a fully synthetic dataset using data provided by the FMB [3] dataset.

These datasets alongside the FMB [3] dataset itself were used to train four different SegFormer [7] models for IR image-based semantic segmentation

Following Experiments have been conducted:

A qualitative assessment of the IR image synthesis quality
A quantitative as well as a qualitative evaluation of the trained segmentation models
A generalization study - the segmentation models trained on FMB or FMB-like (the synthesized datasets) have been applied to an unseen dataset - the Freiburg Thermal [6] dataset

Qualitative assessment - IR Synthesis

Figure 2: This figure exemplary shows results of the IR image synthesis using above-mentioned different approaches.

Segmentation Performance

Table 1: This table shows the mIoU and mAcc scores for the trained SegFormer [7] models. The first column describes what data was provided to SegFormer for each model training. The Oracle model is the model trained on real data in-domain - meaning it was trained on the FMB train set and evaluated on the FMB test set.

Above's table shows the evaluation results of the previously described SegFormer [7] models. As to be expected the model trained on the real FMB dataset, the so-called oracle, model performs best according to the mean Intersection over Union (mIoU) and mean Accuracy (mAcc) metrics shown in the table. Important to highlight is, that the SegFormer [7] trained on the data generated by the proposed pipeline performs best among those who utilize synthetic data.

This observation is confirmed by the qualitative evaluation shown in belows' figure. The segmentation maps of the GT, the oracle model and the model based on the data synthesized by the proposed pipeline look much alike. This, in addition to the quantitative evaluation, suggests that the SegFormer [7] trained on the proposed data achieves near sota performance.

Figure 3: This figure visualizes the results presented in the quantitative evaluation. It shows segmentation maps generated by the individual SegFormer [7] models. Each caption highlights on what data each SegFormer [7] model was trained.

Generalization Study

Table 2: This table shows the results of the quantitative evaluation of the generalization study. The mIoU and mAcc metrics are used to evaluate the performance of models on new datasets.

The results of the generalization study shown in the table above reveal that using the proposed pipeline has great benefits for the generalization capabilities of semantic segmentation approaches, exemplary shown using the SegFormer [7] architecture. The model trained using the data generated by the method presented in this thesis has the highest scores among all models tested. To further visualize the results the figure below shows results of the segmentation models applied to the test set of the Freiburg Thermal [6] dataset.

Figure 4: This figure displays the segmentation results of the generalization study performed on the test set of the Freiburg Thermal [6] dataset.

Conclusion & Outlook

In this work, a pipeline for the synthesis arbitrarily amounts of high-quality IR images corresponding to existing semantic annotations has successfully been presented. Using the proposed approach can help to minimize the existing research gap of too few publicly available datasets in the IR domain. therefore, this work can contribute to the research and development of sota methods of image processing and especially semantic segmentation based on IR images. Furthermore, the effectiveness and benefits of the pipeline for IR image-based semantic segmentation approaches have been proven by extensive experiments.

Future research could focus on evaluating the influence of the use of 'foreign' semantic maps as input for the image synthesis. Semantic maps are described as foreign, if they originate out of a dataset on which the synthesis model has not been trained. For example, a semantic map out of the Freiburg Thermal [6] would be titled as 'foreign' if it would be used as input to synthesize images using a ControlNet [8] model trained on the FMB [9] dataset. Other possibilities include a more detailed hyperparamter-tuning for ControlNet [8] as well as for the used SegFormer [7] models as well as experiments towards the closure of the synthetic-to-real domain-gap. Such experiments could try to use little amounts of real data to fine-tune a synthesis model that has been pre-trained on a synthesized model.

Bibliography

[1] Zülfiye Kütük and Görkem Algan. “Semantic Segmentation for Thermal Images: A Comparative Survey”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. June 2022, pp. 286–295. url: https://openaccess.thecvf.com/content/CVPR2022W/PBVS/html/Kutuk_Semantic_Segmentation_for_Thermal_Images_A_Comparative_Survey_CVPRW_2022_paper.html

[2] Dong–Guw Lee et al. “Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels”. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2023, pp. 8291–8298. url: https://ieeexplore.ieee.org/abstract/document/10161210/;
https://github.com/RPM-Robotics-Lab/sRGB-TIR

[3] Jinyuan Liu et al. “Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation”. In: International Conference on Computer Vision. 2023. url: https://arxiv.org/abs/2308.02097

[4] Robin Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2022. url: https://arxiv.org/abs/2112.10752, https://github.com/CompVis/latent-diffusion.

[5] Tamar Rott Shaham et al. “Spatially-adaptive pixelwise networks for fast image translation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 14882–14891. url: https://tamarott.github.io/ASAPNet_web/

[6] Johan Vertens, Jannik Z ̈urn, and Wolfram Burgard. “HeatNet: Bridging the Day-Night Domain Gap in Semantic Segmentation with Thermal Images”. In: arXiv preprint arXiv:2003.04645 (2020). url: http://thermal.cs.uni-freiburg.de/

[7] Enze Xie et al. “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”. In: Neural Information Processing Systems (NeurIPS). 2021. url: https://github.com/NVlabs/SegFormer

[8] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. “Adding Conditional Control to Text-to-Image Diffusion Models”. Feb. 2023. url: https://github.com/lllyasviel/ControlNet

[9] Jinyuan Liu et al. “Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation”. In: International Conference on Computer Vision. 2023. url: https://arxiv.org/abs/2308.02097.

Conditional Infrared Image Synthesis for Semantic Segmentation

Christian Mayr