Sourena Abdolzadeh
Fine-Tuning and Evaluation of the Segment Anything Model for Tree Crown Segmentation Using Multi-Annotator Data
Duration: 6 months
Completion: May 2026
Supervisor: Dr.-Ing. David Collmar
Examiner: Prof. Dr.-Ing. Uwe Sörgel
Introduction
Tree crown segmentation is an important task in remote sensing and geospatial analysis. Accurate information about individual tree crowns can support urban green monitoring, forest analysis, environmental management, and vegetation mapping. However, tree crown segmentation in UAV-derived orthoimagery is challenging. Tree crowns often have irregular shapes, overlapping boundaries, different sizes, and similar textures to surrounding vegetation. In dense areas, several neighbouring crowns may appear as one connected canopy, while small or shadowed trees can be difficult to separate.
The Segment Anything Model (SAM) is a powerful foundation model for image segmentation. It can generate object masks from simple prompts such as points or bounding boxes and has shown strong general segmentation ability. However, SAM was not specifically designed for UAV-derived orthoimagery. In this type of imagery, tree crowns can be visually ambiguous, and the expected object boundaries may differ from the masks produced by the zero-shot model. Therefore, task-specific adaptation through fine-tuning can still be useful.
A second important challenge is the definition of ground truth. Tree crown boundaries are not always objectively clear, and different annotators may mark the same tree in different ways. Some annotators may draw wider crowns, others may be more conservative, and some small or unclear trees may be missed completely. Because of this, this thesis does not rely on only one annotation. Instead, it investigates how multiple human annotations can be integrated and how this affects the fine-tuning and evaluation of SAM.
The main goal of this thesis is to evaluate whether decoder-only fine-tuning can improve SAM for tree crown segmentation using multi-annotator data. The work also analyses how different annotation representations, such as consensus, core, and manually refined hard ground truth, influence training and evaluation.
Methodology
The dataset consists of 20 high-resolution UAV-derived orthoimage tiles. These tiles were created by dividing one larger orthophoto into smaller image sections. Each tile has a resolution of approximately 2353 × 3200 pixels and contains multiple tree crowns, grass areas, shadows, and other vegetation structures. Since all image tiles originate from the same orthophoto, the dataset provides a controlled setting for studying annotation variability and model fine-tuning. At the same time, this also limits the generalization of the results to other locations or image conditions.
The annotations were created by 20 human annotators. Each annotator labelled five image tiles, and the assignment was organized so that each image was annotated by five different people. The annotations were stored as binary masks, where foreground pixels represent tree crown regions and background pixels represent non-tree areas.
Because each binary mask can contain several tree crown regions, the first processing step was instance extraction. Connected-component analysis with 8-connectivity was used to separate individual tree crown candidates from each binary mask. Very small components were removed using a minimum area threshold to reduce noise and annotation artifacts.
After instance extraction, tree crown instances from different annotators were matched to determine which masks refer to the same physical tree. The main matching criterion was Intersection over Union (IoU). Since the same tree can be annotated with slightly shifted or differently shaped masks, a fallback rule based on centroid distance and area similarity was also used. Matched instances were then grouped into clusters, where each cluster represents one candidate tree crown supported by one or more annotators.
Based on these matched clusters, different annotation representations were created. The consensus representation was generated by majority voting and represents regions supported by several annotators. This representation was used as the main training supervision. The core representation contains only stricter high-agreement regions and was mainly used to analyse annotation uncertainty. The hard ground truth was manually refined and used as the final reference for validation and testing. This separation is important because consensus and core masks describe annotator agreement, while the hard ground truth is used as the final evaluation reference.
For fine-tuning, SAM was adapted using a decoder-only strategy. The image encoder and prompt encoder were kept frozen, while only the mask decoder was trained. This keeps the general pretrained image representation of SAM and reduces the risk of overfitting on the small dataset. For each training instance, a positive foreground point and a bounding box were automatically generated from the instance mask and used as prompts for SAM.
Experiments and Results
The experiments compare the original pretrained SAM model with the fine-tuned SAM model. Two SAM variants were evaluated: ViT-B and ViT-H. The experiments used image-level train-validation-test splits to avoid data leakage. This means that all samples derived from the same original image tile were kept in the same split.
The evaluation was performed using three metrics. IoU and Dice were used to measure region overlap between the prediction and the hard ground truth. HD95 was used as a boundary-based metric to evaluate how closely the predicted boundaries follow the reference boundaries. This combination is important because a model can improve in area overlap while still producing inaccurate boundaries.
The results show that decoder-only fine-tuning leads to consistent but moderate improvements over baseline SAM. Both ViT-B and ViT-H benefited from fine-tuning, but the improvement was clearer for ViT-H. In terms of IoU, ViT-H improved by 2.95%, while ViT-B improved by 0.90% under the 16-2-2 split. Under the 12-4-4 split, ViT-H improved by 2.68%, while ViT-B improved by 1.69%. The HD95 values also improved after fine-tuning, showing better boundary agreement with the hard ground truth. The strongest and most stable results were achieved by ViT-H.
The visual results support the quantitative findings. Fine-tuning mainly improves local mask quality and boundary alignment. In many cases, the fine-tuned model better follows the target tree crown regions and corrects parts of the baseline SAM output. However, the improvement should be understood as a refinement, not as a complete transformation of the model. Difficult areas remain challenging, especially where tree crowns overlap, boundaries are unclear, or the training masks contain uncertainty.
The results also show that annotation quality and ground truth definition strongly influence the interpretation of model performance. Consensus masks are useful as training supervision, but they can miss valid trees if those trees were not annotated by enough people. Core masks show high-confidence regions but can be too conservative. Therefore, using a manually refined hard ground truth for final evaluation provides a more reliable reference for comparing baseline and fine-tuned SAM.
Conclusion
This thesis shows that decoder-only fine-tuning can improve SAM for tree crown segmentation in UAV-derived orthoimagery, especially when using the larger ViT-H model. The improvements are moderate but consistent across the tested settings. Fine-tuning mainly acts as a task-specific refinement that helps SAM better adapt to the appearance of tree crowns and the selected annotation style.
At the same time, the results are limited by the small dataset size, the fact that all image tiles come from one orthophoto, and the uncertainty in human annotations. The work therefore highlights that model adaptation and ground truth preparation should be considered together. For future work, a larger and more diverse dataset, improved instance separation in dense canopy areas, and more advanced uncertainty-aware training strategies could further improve SAM-based tree crown segmentation.
Ansprechpartner
Uwe Sörgel
Prof. Dr.-Ing.Institutsleiter, Fachstudienberater
David Collmar
Dr.-Ing.Gruppenleiter Geoinformatik