Deep Learning for the Classification of Building Facades

Dominik Laupheimer

Duration of the Thesis: 6 months
Completion: August 2017
Supervisor: Dipl.-Ing. Patrick Tutzauer
Supervisor & Eaminer: Prof. Dr.-Ing. Norbert Haala

In the last few years Artificial Neural Networks (ANNs) have become state of the art in various research areas like image recognition, object recognition and speech recognition. In particular, Convolutional Neural Networks (CNNs) proved to be very useful in the visual space. We established a data-driven end-to-end approach for classifying images of building facades into 5 different classes by using CNNs. The architecture of a CNN can be seen in figure 1. Considered classes are given in table 1.

Figure 1: Architecture of a CNN for multiclass classification with an input layer, multiple hidden layers and an output layer. The amount of classes n_classes defines the size of the output layer. The depth of hidden layers depends on the amount of used kernels applied to the previous layer.

Table 1: Considered classes for the classification of images of building facades.

Different state-of-the-art networks and own networks on different sized data sets have been trained. Besides, the performance of networks trained from scratch and pretrained networks are compared. The best performing network (pretrained InceptionV3) achieved an overall accuracy of approximately 64%. This is due to the problem's complexity (high intra-class variations, occlusions, different scales, different illuminations, etc.). Figure 2 shows the normalized confusion matrix of the best-performing network and the corresponding classification report.

Figure 2: Normalized confusion matrix (left) and classification report (right) of the pretrained InceptionV3 (on ImageNet) net.

As expected, classes specialUse and underConstruction perform worse than classes commercial, hybrid and residential due to their definitions (high intra-class variance and bad separability, see table 1). Furthermore, residential performs best, which is desirable as the majority of real-world buildings belong to this class. Figure 3a shows the Class Activation Map (CAM) of a correct prediction. The CAM visualizes which areas are important for the network's decision. By evaluating CAMs we could locate the problems for the image classification task. There are misclassifications due to not detected features and misclassifications due to misinterpreted features (see figure 3b).


Figure 3a: Correct prediction.	Figure 3b: False prediction.	Figure 3c: "False" prediction.
Figure 3: CAMs of correct and false predictions. The small bell tower in figure 3b is mistaken as a chimney causing a false prediction. The "false" prediction is a result of the discrepancy between human labeling and automatic labeling. The transformer house occupying the major area is decisive for the human-given label. The CNN focuses on the marginal residential building.

Apart from these misclassifications, there are classifications that contribute to the classification error as predictions and ground truth labels differ due to different focusing. From a human point of view the building covering the majority is decisive for labeling, but CNNs only take care of features with the most impact on predictions - independently of location and size within the image. Figure 3c shows exemplarily the CAM of such a "false" prediction. Basically, these predictions are not necessarily false predictions, but they still are lowering the overall accuracy if they do not match the label given by human beings. As a consequence, the classification of building facades currently realized in form of image classification underestimates the real power. Performing multilabel classification or object detection and object classification in future is highly recommended.

Deep Learning for the Classification of Building Facades

Dominik Laupheimer