Fine-tuning SAM using Crowdsourcing for the Acquisitions of Tree Outlines

Dingze Li

Duration: 6 months
Completion: August 2024
Supervisor: M.Sc. David Colmar
Examiner: Dr.-Ing. Volker Walter

Introduction

Segmentation in remote sensing image analysis is vital for understanding object location, category, and shape, significantly advancing Earth observation technologies for applications such as disaster monitoring and urban planning. The Segment Anything model (SAM [1]), a prominent AI model pre-trained on extensive datasets, excels at accurate object segmentation. Despite SAM's impressive performance, fine-tuning it for specific scenes remains challenging due to the varying features of different environments. To address this, recent approaches have adopted parameter-efficient fine-tuning methods such as Low-rank Adaption (LoRA [2]), which adjusts SAM’s parameters to accommodate new scenarios and reduces the amount of data required.

Crowdsourcing provides an effective solution for acquiring the volumes of annotated data necessary for fine-tuning. By using repeated labeling, where multiple non-expert workers annotate the same instances, the accuracy and reliability of data are enhanced. This thesis explores the feasibility of fine-tuning SAM using crowdsourced data, examining its potential and limitations while assessing how dataset quality impacts model performance.

Methodology

In this research, we conducted an aerial survey using a drone to capture images of trees in the vicinity of Stuttgart. From the collected data, we manually selected a subset of 50 images, each containing a tree, and divided this subset into training and validation sets.

Then, we want to crowdsource a certain amount of training data for fine-tuning the model's needs. In order to get the data we want, we need to build a user-friendly web tool as the front-end interactive interface, and at the same time run a local SAM in the back-end, which guides the user through the whole task by transferring data from the front-end to the back-end. We have designed three different web tools (positive points tool, positive and negative points tool and question interactive tool, for convenience, we abbreviate them as: P tool, PN tool and Question tool, as well as the corresponding models: P model, PN model and Question model) in total to test the impact of various inputs on fine-tuning SAM. The workflow for the three web tools is shown in Figure 1.

Finally, we input different data sets into LoRA to train and obtain various fine-tuning models. We then use our validation set to evaluate and compare the differences between these models.

Figure 1. The workflow of three different web tools.

Experiment and Results

We input the three data sets into LoRA to obtain three different fine-tuning models. We then used the validation set with and without input to compute the IoU (Intersection over Union) and plotted the corresponding histograms. All histograms are shown in Figure 2.

We can find that fine-tuning substantially enhances model performance when inputs are provided, with the PN model achieving the highest accuracy and precision. However, performance declines significantly in the absence of input, revealing a strong reliance on inputs for optimal IoU. SAM exhibits high precision but low accuracy with inputs, and low precision but high accuracy without inputs. In contrast, the Question model and the P model show opposite trends in precision and accuracy. Among all models, the PN model consistently demonstrates the best performance, maintaining the highest precision and accuracy both with and without inputs, outperforming other fine-tuned models in all scenarios.

Figure 2. Histograms of IoU between SAM, three different fine-tuning models and references with and without inputs.

To assess whether the quality of the training sets affects the fine-tuning models, we removed the lower-quality data from the original three training sets and re-trained LoRA to obtain three "high-quality" models. As before, we evaluated these models using the validation set, computing IoU with and without input, and plotted the corresponding histograms. All histograms are shown in Figure 3.

We find that the models generally benefit from higher-quality training data sets, with varying impacts depending on the model type. The PN model consistently improves in both accuracy and precision, regardless of whether inputs are provided or not. The Question model also shows significant enhancements in both accuracy and precision, with notable improvements when inputs are removed. In contrast, the P model exhibits mixed results: precision improves with inputs, but accuracy slightly drops, and both metrics decline when inputs are absent.

Overall, higher-quality training datasets and the presence of inputs positively influence model performance, though the effects differ by model. The PN and Question models show clear and consistent improvements, while the P model presents a more complex pattern with less predictable outcomes.

Figure 3. Histograms of IoU between SAM, three different fine-tuning models, three high-quality fine-tuning models and references with and without inputs. (Blue is the old models and pink is the high-quality models)

For a specific image, we get the masks generated by SAM, and six fine-tuning models (See detail in figure 4). We can also find that the comparison reveals significant improvements in the PN model and Question model, where erroneous masks adjacent to the ground truth have been eliminated, with the PN model coming very close to the ground truth. However, for the P model, not only has the erroneous mask adjacent to the ground truth not been removed, but a large new erroneous mask has even appeared on the left edge of the image. This observation is consistent with our previous analysis.

(a)

(b) Figure 4. Original image, ground truth and the results of SAM and our models. (a) shows the results of old models trained by complete dataset. (b) shows the results of high-quality models trained by filtered dataset.

Conclusions

Our study highlights the substantial impact of dataset quality and input methodologies on fine-tuning SAM with crowdsourcing data. Fine-tuning significantly enhances model performance with inputs, with the PN model achieving the highest accuracy and precision. However, performance drops without inputs, showing a strong dependence on input for high IoU. SAM exhibits high precision but low accuracy with inputs, and low precision but high accuracy without inputs, while the Question and P models show varying patterns in these metrics.

When refining datasets by removing lower-quality data, the PN model consistently improved in both accuracy and precision across scenarios, whereas the P model showed mixed results and the Question model demonstrated notable enhancements. Overall, high-quality datasets markedly improve model performance and underscore the importance of rigorous data validation. These findings suggest that balanced input approaches and refined crowdsourcing strategies are crucial for optimizing future model fine-tuning.

References (Selection)

[1] Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 4015-4026.

[2] Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. arXiv preprint arXiv:2106.09685, 2021.

Fine-tuning SAM using Crowdsourcing for the Acquisitions of Tree Outlines

Dingze Li