Abstract

Deep learning frameworks allowed for a remarkable advancement in semantic segmentation, but the data hungry nature of convolutional networks has rapidly raised the demand for adaptation techniques able to transfer learned knowledge from label-abundant domains to unlabeled ones. In this paper we propose an effective Unsupervised Domain Adaptation (UDA) strategy, based on a feature clustering method that captures the different semantic modes of the feature distribution and groups features of the same class into tight and well-separated clusters. Furthermore, we introduce two novel learning objectives to enhance the discriminative clustering performance: an orthogonality loss forces spaced out individual representations to be orthogonal, while a sparsity loss reduces class-wise the number of active feature channels. The joint effect of these modules is to regularize the structure of the feature space. Extensive evaluations in the synthetic-to-real scenario show that we achieve state-of-the-art performance.

Code

The code can be found here.

Method

The method is illustrated in Figure 1.

Fig. 1: Overview of the proposed approach. Features after supervised training on the source domain are represented in light gray, while features of the current step are colored. A set of techniques is employed to better shape the latent feature space spanned by the encoder. Features are clustered and the clusters are forced to be disjoint. At the same time, features belonging to different classes are forced to be orthogonal with respect to each other. Additionally, features are forced to be sparse and an entropy minimization loss could also be added to guide target samples far from the decision boundaries.

Results

The main quantitative and qualitative results are reported in the following.

Table 1: Numerical evaluation of the GTA5 and SYNTHIA to Cityscapes adaptation scenarios in terms of per-class and mean IoU. Evaluations are performed on the validation set of the Cityscapes dataset. In all the experiments the DeepLab- V2 segmentation network is employed, with VGG-16 (top) or ResNet-101 (bottom) backbones. The mIoU* results in the last column refer to the 13-classes configuration, i.e., classes marked with * are ignored. MaxSquares IW (r) denotes our re-implementation, as original results are provided only for the ResNet-101 backbone.

Fig. 2: Semantic segmentation of some sample scenes from the Cityscapes validation dataset when adaptation is performed from the GTA5 source dataset and the DeepLab-V2 with ResNet-101 backbone is employed.

Contacts

For any information on the method you can contact lttm@dei.unipd.it

Have a look at our website http://lttm.dei.unipd.it for other works on this topic.

References

[1] M. Toldo, U. Michieli, P. Zanuttigh, "Unsupervised Domain Adaptation in Semantic Segmentation via Orthogonal and Clustered Embeddings", Accepted for publication in Winter Conference on Applications of Computer Vision (WACV), 2021

xhtml/css website layout by Ben Goldman - http://realalibi.com