Crowdsourcing for Acquiring High-Quality Training Data for Machine Learning Methods

Machine Learning Algorithms Learning from the Crowd

Machine learning techniques such as Convolutional Neural Networks have become state-of-the-art for interpretation and automatic annotation of various data. However, these methods often require huge amounts of labeled data (“Ground Truth”). Especially in the domain of geospatial data analysis such datasets are scarce. Commonly, labels are collected by experts, but since the labeling process is very time-consuming, this approach causes tremendous costs and might not be feasible.

Hence, the focus of this research lies in outsourcing this tedious and costly task to the crowd (“Crowdsourcing”), which is composed of people from various countries, with different cultural backgrounds and different abilities. Therefore, a very inhomogeneous quality is to be expected. Hence, methods of quality control are crucial when using this data as training data for machine learning algorithms.

As already demonstrated in various studies, one remedy is the “Wisdom of the Crowd”. This means, that a large pool of humans is able of doing tasks one individual is not capable of by combining the strengths of all workers of the crowd. The easiest realization of this technique is Majority Vote.

Although this approach can lead to high-quality results, we need to conduct multiple acquisitions by different crowdworkers. While it is one goal of this research topic to determine the amount of required multiple acquisitions, we aim at minimizing the total labeling effort. For this purpose we combine both the strengths of machine learning techniques and human perception. In this context, Active Learning strategies can be applied, where the machine queries for most informative instances, so that human labeling effort can be deployed purposefully by focusing on these samples.

In order to reach a vast pool of potential workers and acquire labels, web-based tools are employed for presenting data and enabling the crowd to contribute (see Figure 1 and 2). While it is our goal to collect labels for multiple data representations by the crowd such as imagery, 3D point clouds and 3D meshes, we are further investigating which representation is best suited for the acquisition of labels by the crowd.

Implemented web tool for labeling points in 3D point clouds.

© Michael Kölle

Implemented web tool for annotation of trees in 3D point clouds.

© Michael Kölle

ifp publications

Walter, V. & Soergel, U. [2018]
Results and Problems of paid Crowd-based geospatial Data Collection. PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science. pp 1-11.
DOI: 10.1007/s41064-018-0058-z

Walter, V.; Laupheimer, D. & Fritsch, D. [2016]
Use and Optimization of Paid Crowdsourcing for the Collection of Geodata. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLI-B4, pp. 253-257.
DOI: 10.5194/isprs-archives-XLI-B4-253-2016

Volker Walter

Volker Walter

Head of Research Group Geoinformatics

Michael Kölle

Michael Kölle

Research Associate

To the top of the page