CAFENET: CLASS-AGNOSTIC FEW-SHOT EDGE DETECTION NETWORK

Abstract

We tackle a novel few-shot learning challenge, few-shot semantic edge detection, aiming to localize boundaries of novel categories using only a few labeled samples. Reliable boundary information has been shown to boost the performance of semantic segmentation and localization, while also playing a key role in its own right in object reconstruction, image generation and medical imaging. Few-shot semantic edge detection allows recovery of accurate boundaries with just a few examples. In this work, we present a Class-Agnostic Few-shot Edge detection Network (CAFENet) based on meta-learning strategy. CAFENet employs a semantic segmentation module in small-scale to compensate for lack of semantic information in edge labels. The predicted segmentation mask is used to generate an attention map to highlight the target object region, and make the decoder module concentrate on that region. We also propose a new regularization method based on multi-split matching. In meta-training, the metric-learning problem with highdimensional vectors are divided into smaller subproblems with low-dimensional sub-vectors. Since there are no existing datasets for few-shot semantic edge detection, we construct two new datasets, FSE-1000 and SBD-5 i , and evaluate the performance of the proposed CAFENet on them. Extensive simulation results confirm that the proposed CAFENet achieves better performance compared to the baseline methods using fine-tuning or few-shot segmentation.

1. INTRODUCTION

Semantic edge detection aims to identify pixels that belong to boundaries of predefined categories. Boundary information has been shown to be effective for boosting the performance of semantic segmentation (Bertasius et al., 2016; Chen et al., 2016) and localization (Yu et al., 2018a; Wang et al., 2015) . It also plays a key role in applications such as object reconstruction (Ferrari et al., 2007; Zhu et al., 2018) , image generation (Isola et al., 2017; Wang et al., 2018) and medical imaging (Abbass & Mousa, 2017; Mehena, 2019) . Early edge detection algorithms interpret the problem as a low-level grouping problem exploiting hand-crafted features and local information (Canny, 1986; Sugihara, 1986) . Recently, there have been significant improvements on edge detection thanks to the advances in deep learning. Moreover, beyond previous boundary detection, category-aware semantic edge detection became possible (Acuna et al., 2019; Hu et al., 2019; Yu et al., 2018b) . However, it is impossible to train deep neural networks without massive amounts of annotated data. To overcome the data scarcity issue in image classification, few-shot learning has been actively discussed for recent years (Finn et al., 2017; Lifchitz et al., 2019) . Few-shot learning algorithms train machines to learn previously unseen classification tasks using only a few relevant labeled examples. More recently, the idea of few-shot learning is applied to computer vision tasks requiring highly laborious and expensive data labeling such as semantic segmentation (Dong & Xing, 2018; Wang et al., 2019) and object detection (Fu et al., 2019; Karlinsky et al., 2019) . Based on meta-learning across varying tasks, the machines can adapt to unencountered environments and demonstrate robust performance in various computer vision problems. In this paper, we consider a novel few-shot learning challenge, few-shot semantic edge detection, to detect the semantic boundaries using only a few labeled samples. Through experiments, we show that few-shot semantic edge detection can not be simply solved by fine-tuning a pretrained semantic edge detector or utilizing a nonparametric edge detector in a few-shot segmentation setting. To tackle this elusive challenge, we propose a class-agnostic few-shot edge detector (CAFENet) and present new datasets for evaluating few-shot semantic edge detection. Fig. 1 shows the architecture of the proposed CAFENet. Since the edge labels do not contain enough semantic information due to the sparsity of labels, performance of the edge detector severely degrades when the training dataset is very small. To overcome this, we adopt the segmentation process in advance of detecting edge with downsized feature and segmentation labels generated from boundaries labels. We utilize a simple metric-based segmentator generating a segmentation mask through pixel-wise feature matching with class prototypes, which are computed by masked average pooling of (Zhang et al., 2018) . The predicted segmentation mask provides the semantic information to the edge detector. The multi-scale attention maps are generated from the segmentation mask, and applied to corresponding multi-scale features. The edge detector predicts the semantic boundaries using the attended features. Using this attention mechanism, the edge detector can focus on relevant regions while alleviating the noise effect of external details. 

2.1. FEW-SHOT LEARNING

To tackle the few-shot learning challenge, many methods have been proposed based on metalearning. Optimization-based methods (Finn et al., 2017; Ravi & Larochelle, 2016) train the metalearner which updates the parameters of the actual learner so that the learner can easily adapt to a new task within a few labeled samples. Metric-based methods (Vinyals et al., 2016; Snell et al., 2017; Yoon et al., 2019) train the feature extractor to assemble features from the same class together on the embedding space while keeping features from different classes far apart. Recent metric-based approaches propose dense classification (Hou et al., 2019; Kye et al., 2020) . Dense classification trains an instance-wise classifier on pixel-wise classification loss which imposes coherent predictions over the spatial dimension and prevents overfitting as a result. Our model adopts the metric-based method for few-shot learning. Inspired by dense classification, we propose multi-split matching regularization which divides the feature vector into sub-vector splits and performs split-wise classification for regularization in meta-learning.

2.2. FEW-SHOT SEMANTIC SEGMENTATION

The goal of few-shot segmentation is to perform semantic segmentation within a few labeled samples based on meta-learning (Shaban et al., 2017; Dong & Xing, 2018; Wang et al., 2019) . OSLSM of (Shaban et al., 2017) adopts a two-branch structure: conditioning branch generating element-wise scale and shift factors using the support set and segmentation branch performing segmentation with a fully convolutional network and task-conditioned features. Co-FCN (Rakelly et al., 2018 ) also utilizes a two-branch structure. The globally pooled prediction is generated using support set in conditioning branch, and fused with query features to predict the mask in segmentation branch. SG-One of (Zhang et al., 2018) proposes a masked average pooling to compute prototypes from pixels of support features. The cosine similarity scores are computed between the prototypes and pixels of query feature, and the similarity map guides the segmentation process. CANet of (Zhang et al., 2019) also adopts masked average pooling to generate the global feature vector, and concatenate it with every location of the query feature for dense comparison in predicting the segmentation mask. PANet of (Wang et al., 2019) introduces prototype alignment, predicting the segmentation mask of support samples using query prediction results as labels of query samples, for regularization. PMM of Yang et al. (2020) utilizes multiple prototypes generated using Expectation-Maximization (EM) algorithm to effectively leverage the semantic information from the few labeled samples.

2.3. SEMANTIC EDGE DETECTION

Semantic edge detection aims to find the boundaries of objects from image and classify the objects at the same time. The history of semantic edge detection (Acuna et al., 2019; Hu et al., 2019) dates back to the work of (Prasad et al., 2006) which adopts the support vector machine as a semantic classifier on top of the traditional canny edge detector. Recently, many semantic edge detection algorithms rely on deep neural network and multi-scale feature fusion. CASENET of (Yu et al., 2017) addresses the semantic edge detection as a multi-label problem where each boundary pixel is labeled into categories of adjacent objects. Dynamic Feature Fusion (DFF) of (Hu et al., 2019) proposes a novel way to leverage multi-scale features. The multi-scale features are fused by weighted summation with fusion weights generated dynamically for each images and each pixel. Meanwhile, Simultaneous Edge Alignment and Learning (SEAL) of (Yu et al., 2018b) deals with severe annotation noise of the existing edge dataset (Hariharan et al., 2011) . SEAL treats edge labels as latent variables and jointly trains them to align noisy misaligned boundary annotations. Semantically Thinned Edge Alignment Learning (STEAL) of (Acuna et al., 2019) improves the computation efficiency of edge label alignment through a lightweight level set formulation. 

4. METHOD

We propose a novel algorithm for few-shot semantic edge detection. Fig. 2 illustrates the network architecture of the proposed method. The proposed CAFENet adopts the semantic segmentation module to compensate for the lack of semantic information in edge labels. The predicted segmentation mask is utilized for attention in skip connection. The final edge detection is done using attentive multi-scale features. 4.1 SEMANTIC SEGMENTATOR Most previous works on semantic edge detection directly predict edges from the given input image. However, direct edge prediction is a hard task when only a few labeled samples are given. To overcome this difficulty in few-shot edge detection, we adopt a semantic segmentation module in advance of edge prediction. With the assistance of the segmentation module, CAFENet can effectively 4) extracts multi-level semantic features. The segmentator module generates a segmentation prediction using query feature from E (4) and prototypes PF G, PBG from support set features. Small bottleneck blocks S (0) ∼ S (4) transform the original image and multi-scale features from encoder blocks to be more suitable for edge detection. The attention maps generated from segmentation prediction are applied to multi-scale features to localize the semantically related region. Decoder D (0) ∼ D (4) takes attentive multi-scale features to give edge prediction. localize the target object and extract semantic features from query samples. For few-shot segmentation, we employ the metric-learning which utilizes prototypes for foreground and background as done in (Dong & Xing, 2018; Wang et al., 2019) . Given the support set S = {x s i , y s i } Ns i=1 , the encoder E extracts features {E(x s i )} Ns i=1 from S. Also, for support labels {y s i } Ns i=1 , we generate the dense segmentation mask {M s i } Ns i=1 using a rule-based preprocessor, considering the pixels inside the boundary as foreground pixels in the segmentation label. Using down-sampled segmentation labels {m s i } Ns i=1 , the prototype for foreground pixels P F G is computed as PF G = 1 Ns 1 H × W i j Ej(x s i )m s i,j where j indexes the pixel location, E j (x) and m s i,j denote the jth pixel of feature E(x) and segmentation mask m s i . H, W denote height and width of the images. Likewise, the background prototype P BG is computed as PBG = 1 Ns 1 H × W i j Ej(x s i )(1 -m s i,j ). Following the prototypical networks of (Snell et al., 2017) , the probability that pixel j belongs to foreground for the query sample x q i is p(y q i,j = F G|x q i ; E) = exp(-τ d(Ej(x q i ), PF G)) exp(-τ d(Ej(x q i ), PF G)) + exp(-τ d(Ej(x q i ), PBG)) where d(•, •) is squared Euclidean distance between two vectors and τ is a learnable temperature parameter. With query samples {x q i } Nq i=1 and the down-sampled segmentation labels for query {m q i } Nq i=1 , the segmentation loss L Seg is calculated as the mean-squared error (MSE) loss between predicted probabilities and the down-sized segmentation mask LSeg = 1 Nq 1 H × W Nq i=1 H×W j=1 {(p(y q i,j = F G|x q i ; E) -m q i,j ) 2 }. Note that the segmentation mask is generated in down-sized scale so that any pixel near the boundaries can be classified into the foreground to some extent, as well as the background. Therefore, we approach the problem as a regression using MSE loss rather than cross entropy loss.

4.2. MULTI-SPLIT MATCHING REGULARIZATION

The metric-based few-shot segmentation method utilizes distance metrics between the highdimensional feature vectors and prototypes, as seen in Fig. 3a . However, this approach is prone (Dong & Xing, 2018; Wang et al., 2019) and (b) split-wise feature matching in MSMR to overfit due to the massive number of parameters in feature vectors. To get around this issue, we propose a novel regularization method, multi-split matching regularization (MSMR). In MSMR, high-dimensional feature vectors are split into several low-dimensional feature vectors, and the metric learning is conducted on each vector split as Fig. 3b . With the query feature E(x q i ) ∈ R C×W ×H , where C is channel dimension and H, W are spatial dimensions, we divide E( x q i ) into K sub-vectors {E k (x q i )} K k=1 along channel dimension. Each sub-vector E k (x q i ) is in R C K ×W ×H . Likewise, the prototypes P F G and P BG are also disassembled into K sub-vectors {P k F G } K k=1 and {P k BG } K k=1 along channel dimension where P k F G , P k BG ∈ R C K . For the k th sub-vector of query feature E k (x q i ), the probability that the j th pixel belongs to the foreground class is computed as follows: p k (y q i,j = F G|x q i ; E) = exp(-τ d(E k j (x q i ), P k F G )) exp(-τ d(E k j (x q i ), P k F G )) + exp(-τ d(E k j (x q i ), P k BG )) . (5) MSMR divides the original metric learning problem into K small sub-problems composed of a fewer parameters and acts as regularizer for high-dimensional embeddings. The prediction results of K sub-problems are reflected on learning by combining the split-wise segmentation losses to original segmentation loss in Eq. 4. The total segmentation loss is calculated as LSeg = 1 Nq 1 H × W Nq i=1 H×W j=1 {(pi,j -m q i,j ) 2 + K i=1 (p k i,j -m q i,j ) 2 }. ( ) where p i,j = p(y q i,j = F G|x q i ; E) and p k i,j = p k (y q i,j = F G|x q i ; E).

4.3. ATTENTIVE EDGE DETECTOR

As shown in Fig. 2 , we adopt the nested encoder structure to extract rich hierarchical features. The multi-scale side outputs from encoder E (1) ∼ E (4) are post-processed through bottleneck blocks S (1) ∼ S (4) . Since ResNet-34 gives side outputs of down-sized scale, we pass the original image through bottleneck block S (0) to extract local details in original scale. In front of S (3) , we employ the Atrous Spatial Pyramid Pooling (ASPP) block of (Chen et al., 2017) . We have empirically found that locating ASPP there shows better performance. In utilizing multi-scale features, we employ the predicted segmentation mask M from the segmentator where the j th pixel of M is the predicted probability from Eq. 3. Note that we generate M based on the entire feature vectors and the prototypes instead of utilizing sub-vectors, since the splitwise metric learning is used only for regularizing the segmentation module. For each layer l, M (l) denotes the segmentation mask upscaled to the corresponding feature size by bilinear interpolation. Using segmentation prediction mask M (l) , we generate attention map A (l) , as follows. First, the prediction with a value lower than threshold λ is rounded down to zero, to ignore activation in regions with low confidence. Second, we broaden the attention map using morphological dilation of (Feng et al., 2019) as a second chance, since the segmentation module may not always guarantee fine results. The final attention map of l th layer A (l) is computed as follows A (l) = 1( M (l) > λ) M (l) + Dilation(1( M (l) > λ) M (l) ) where 1( M (l) > λ) M (l) is the rounded value of prediction mask M (l) . The attention maps are applied to the multi-scale features of corresponding bottleneck blocks S (0) ∼ S (4) . We apply the residual attention of (Hou et al., 2019) , where the initial multi-level side outputs from S (l) are pixelwisely weighted by 1 + A (l) , to strengthen the activation value of the semantically important region. We visualize the effect of semantic attention in Fig. 4 . As shown in Fig. 2 , the decoder network is composed of five consecutive convolutional blocks. Each decoder block D (l) contains three 3 × 3 convolution layers. The outputs of decoder blocks D (1) ∼ D (4) are bilinearly upsampled by two and passed to the next block. Similar to (Feng et al., 2019) , the up-sampled decoder outputs are then concatenated to the skip connection features from bottleneck blocks S (0) ∼ S (4) and previous decoder blocks. Multi-scale semantic information and local details are transmitted through skip architectures. The hierarchical decoder network in turn refines the outputs of the previous decoder blocks and finally produces the edge prediction ŷq i of query samples x q i . Following the work of (Deng et al., 2018) , we combine cross-entropy loss and Dice loss to produce crisp boundaries. Given a query set Q = {x q i , y q i } Nq i=1 and prediction mask ŷq i , the cross-entropy loss is computed as LCE = - Nq i=1 { j∈Y + log(ŷ q i ) + j∈Y - log(1 -ŷq i )} where Y + and Y -denote the sets of foreground and background pixels. The Dice loss is then computed as LDice = Nq i=1 { j (ŷ q i,j ) 2 + j (y q i,j ) 2 2 j ŷq i,j y q i,j } where j denotes the pixels of a label. The final loss for meta-training is given by  L f inal = LSeg + LCE + LDice.

5.1.2. FSE-1000

The datasets used in previous semantic edge detection research such as SBD of (Hariharan et al., 2011) and Cityscapes of (Cordts et al., 2016) are not suitable for few-shot learning as they have only 20 and 30 classes, respectively. We propose a new dataset for few-shot edge detection, which we call FSE-1000, based on FSS-1000 of (Wei et al., 2019) . FSS-1000 is a dataset for few-shot segmentation and composed of 1000 classes and 10 images per class with foreground-background segmentation annotation. From the images and segmentation masks of FSS-1000, we build FSE-1000 by extracting boundary labels from segmentation masks. As done in SBD-5 i , we extract thick edges of which thickness is around 2 ∼ 3 pixels on average in the light of difficulty associated with few-shot setting. For dataset split, we split 1000 classes into 800 training classes and 200 test classes. We will provide the detailed class configuration in the Supplementary Material.

5.2. EVALUATION SETTINGS

We use two evaluation metrics to measure the few-shot semantic edge detection performance of our approach: the Average Precision (AP) and the maximum F-measure (MF) at optimal dataset scale (ODS). In evaluation, we compare the unthinned raw prediction results and the ground truths without Non-Maximum Suppression (NMS) following (Acuna et al., 2019; Yu et al., 2018b) . For the evaluation of edge detection, an important parameter is matching distance tolerance which is an error threshold between the prediction result and the ground truth. Prior works on edge detection such as (Acuna et al., 2019; Hariharan et al., 2011; Yu et al., 2017; 2018b) adopt non-zero distance tolerance to resolve the annotation noise. However, the proposed datasets for few-shot edge detection utilize thicker boundaries to overcome the annotation noise issue instead of adopting distance tolerance. Moreover, evaluation with non-zero distance tolerance requires additional heavy computation. This becomes more problematic under few-shot setting where the performance should be measured on the same test image multiple times due to the variation in the support set. For these reasons, we set distance tolerance to be 0 for both FSE-1000 and SBD-5 i . In addition, we evaluate the positive predictions from the area inside an object and zero-padded region as false positives, which is stricter than the evaluation protocol in prior works of (Hariharan et al., 2011; Yu et al., 2017) .

5.3. IMPLEMENTATION DETAIL

We implement our framework using Pytorch library and adopt Scikit-learn library to construct the precision-recall curve and compute average precision (AP). For the encoder, ResNet-34 pretrained on ImageNet is adopted. All parameters except the encoder parameters are learned from scratch. The entire network is trained using the Adam optimizer of (Kingma & Ba, 2014) with weight decay regularization of (Loshchilov & Hutter, 2017) . In both experiments on FSE-1000 and SBD-5 i , we use a learning rate of 10 -4 and an l2 weight decay rate of 10 -2 . For FSE-1000 experiments, the model is trained with 40,000 episodes and the learning rate is decayed by 0.1 after training 38,000 episodes. For SBD-5 i experiments, 30,000 episodes are used for training, and the learning rate is decayed by 0.1 after training 28,000 episodes. Higher shot training of (Liu et al., 2019) is employed in 1-shot experiments for both datasets.

5.3.1. DATA PREPROCESSING

During training, we adopt data augmentation with random rotation by multiples of 90 degrees for both SBD-5 i and FSE-1000. We additionally resize SBD-5 i data to 320×320, while no such resizing is performed on FSE-1000. During evaluation, images of SBD-5 i are zero-padded to 512×512. Again, the original image size is used for FSE-1000.

5.4. EXPERIMENT RESULT

Table 1 shows the experiment results on the SBD-5 i dataset. To verify the value of the proposed method, we compare CAFENet with two baselines. The first baseline is a fine-tuned edge detection model with only a few labeled samples. Meta-learning strategy is not used for the first baseline. We employ the DFF of (Hu et al., 2019) , utilizing the implementation offered by the authors. For each split of SBD-5 i , we pretrain a 15-way semantic edge detector with training classes and finetune the pretrained edge detector with a few labeled samples for new classes in test split. During pretraining, we follow the training strategies and hyperparameters of (Hu et al., 2019) . In finetuning, we randomly initialize some sub-modules that are closely related to final prediction ("side5", "side5-w", and "ada-learner") and train them altogether using the support images. The second baseline is constructed by combining a non-parametric edge detectors such as Canny detector or Sobel detector with a few-shot segmentation algorithm. The semantic edge detection is occasionally interpreted as a dual task of semantic segmentation, but prior work of (Acuna et al., 2019) verifies that the semantic edge detector outperforms the segmentator combined with the Sobel edge detector, and demonstrates the importance of semantic edge detection. In our experiments, we combine PANet and PMM, the state-of-the-art few-shot segmentation method, with the Sobel edge detector. To experiment with existing segmentation methods, we utilize the implementation provided by the authors. For each split of SBD-5 i , we meta-train the PANet and PMM on training classes using the segmentation labels. In evaluation, we obtain the edge predictions of test classes by applying the Sobel edge detector on the segmentation masks as done in (Acuna et al., 2019) , and compare the predictions with the edge labels of test classes. We utilize the ResNet-34 backbone for CAFENet and PANet. For PMM, we utilize ResNet-50 backbone. We also employ higher shot training in 1-shot experiments for both baselines as done in CAFENet experiments. The results in Table 1 show that the proposed CAFENet outperforms all baselines in both MF and AP scores by significant margin. The experiment results prove that the few-shot semantic edge detector can not be simply substituted by few-shot segmentator or the fine-tuned semantic edge detector. In Table 2 , the experiment results on the FSE-1000 dataset are shown. For FSE-1000, we only experiment with the few-shot segmentation baseline since it is hard to train a semantic edge detector with training set of FSE-1000, due to the large number of training classes. We can see that the proposed CAFENet outperforms the baseline even when the dataset contains more diverse classes. During meta-training of CAFENet, we set the number of query samples in training episodes to be 5 for FSE-1000 and 10 for SBD-5 i , respectively. In evaluation, we employ average precision score function of Scikit-learn library to measure the Average Precision (AP) score. We compute the AP score for each image and average them to measure the overall performance. For Maximum Fmeasure (MF) score, we measure true positives (TP), false positives (FP) and false negatives (FN) at every 0.01 threshold intervals for each image, and accumulate the values for all images in 1000 test episodes. The MF score is computed using the accumulated TP, FP, and FN values.

B ABLATION STUDIES

In this section, we shot the results of ablation experiments to examine the impact of proposed MSMR and attentive decoder. Table B .1 shows the experiment results on the FSE-1000 dataset. The baseline method in Table B .1 conducts edge prediction in low resolution and utilizes the auxiliary loss from low resolution together with the loss from the edge prediction in original resolution for metatraining. The edge prediction is done using a metric-based method utilizing prototypes which are computed using down-sampled edge labels. The method dubbed as Seg utilizes a segmentation module without MSMR or attentive decoding. Seg directly matches high-dimensional query feature vectors with prototypes in both training and evaluation. In Seg, the segmentation module is utilized only to provide the segmentation loss that assists for model learning to extract semantic features. Seg + Att employs the predicted segmentation mask for the additional attention process in skip architecture. Seg + MSMR + Att additionally utilizes the MSMR regularization for training. For fair comparison, all methods use the same network architecture and training hyperparameters. For SBD-5 i datasets, the ablation experiments are done with same model variations as FSE-1000. The results on SBD-5 i are shown in Table B .2. Tables B.1 and B.2 demonstrate that the use of the segmentation module in Seg gives significant performance advantages over baseline for both FSE-1000 and SBD-5 i datasets. It is also seen that the additional use of attentive decoding, Seg + Att, generally improves the performance over Seg. Finally, adding the effect of MSMR regularization gives substantial extra gains, as seen by the scores associated with Seg + MSMR + Att. Clearly, when compared with baseline, our overall approach Seg + MSMR + Att provides large gains. In the main paper, we report the results of Seg + MSMR + Att as the results of CAFENet.

B.1.1 FEATURE MATCHING METHOD FOR SEGMENTATION

In Table B .3, we have compared various feature matching methods between prototypes and query feature vectors for producing segmentation prediction on SBD-5 i . The method baseline refers to the original method generating segmentation prediction using only the similarity metric between high-dimensional vectors as done in Eq. 3. For the method average, segmentation predictions from low-dimensional feature splits (Eq. 5) and original high-dimensional feature vectors (Eq. 5) are averaged to generate the final prediction mask. The average method can be understood as a method utilizing MSMR not only for regularization, but also for inference. In the weighted sum method, the above five segmentation masks are combined using a weighted sum with learnable weights. As we can see in Table B .3, the MSMR method shows the best performance when employed for regularization. MSMR divides the high-dimensional feature into multiple splits. Table B .4 shows the performance of proposed CAFENet with varying numbers of splits K. Comparing the K = 1 case with other cases, we can see that applying MSMR regularization consistently improves performance. We can see that K = 4 results in the best AP and MF performance. The performance gain is marginal when we divide the embedding dimension into too small (K = 16) or too big (K = 2) a pieces. In MSMR, we divide query feature into K splits along the channel dimension, i.e. we apply deterministic splitting. In Table B .5, we compare the performance of different splitting methods. The method dubbed as Deterministic refers to the MSMR method that we utilize in CAFENet. The Random method randomly splits feature vectors into 4 parts in each episode. The Baseline method does not split features at all. Interestingly, Random's performance significantly degrades, even below that of Baseline. In the proposed CAFENet, we utilize the semantic attention in the attentive decoder. In Table B .6, we compare three methods that utilize the attention in different manners. Attentive Decoding is our proposed CAFENet that applies the semantic attention to the multi-scale features. The second method, Direct Attention, is a method that directly passes the feature to the edge detector and applies the semantic attention to the final edge prediction. The last method, No Attention is a baseline where the edge detector generates prediction without any attention. For both 1-shot and 5-shot settings, regardless of how the attention is applied, semantic attention considerably improves performance. These results show the effectiveness of semantic attention. The results also show that the proposed approach of Attentive Decoding yields better results compared to Direct Attention.

B.2.2 EXPERIMENTS ON GENERATING THE ATTENTION

In the proposed attentive decoding, we can generate an attention map using various methods. In Table .B.7, we compare two different methods to generate the attention map. Segmentation Attention is the method adopted in proposed CAFENet. In Segmentation Attention, the output of the segmentation module S is utilized as the M in equation 7. In Edge Attention, the edge prediction E(S) is generated from segmentation mask S by equation 11 following (Feng et al. (2019) ). The generated edge prediction E(S) is then used as the M in equation 7. Experiment results show that utilizing the segmentation mask to generate the attention map performs better than utilizing the edge prediction. In this section, we report the few-shot semantic edge prediction results with multi-angle input test. E(S) = |S -AvgP ool(S)| In multi-angle input test, the model predicts the edge by averaging the 4 edge prediction results 4 copies of an input image rotated by multiples of 90 degrees. We have empirically found that multiangle input test significantly improves the performance. Table F .1 and F.2 show the evaluation results with multi-angle input test for FSE-1000 and SBD-5 i , respectively. We can verify the effectiveness of the multi-angle input test from the results. 



Figure 1: Architecture overview of the proposed CAFENet. The feature extractor or encoder extracts feature from the image, the segmentator generates a segmentation mask based on metric learning, and the edge detector detects semantic boundaries using the segmentation mask and query features.

Fig.1shows the architecture of the proposed CAFENet. Since the edge labels do not contain enough semantic information due to the sparsity of labels, performance of the edge detector severely degrades when the training dataset is very small. To overcome this, we adopt the segmentation process in advance of detecting edge with downsized feature and segmentation labels generated from boundaries labels. We utilize a simple metric-based segmentator generating a segmentation mask through pixel-wise feature matching with class prototypes, which are computed by masked average pooling of(Zhang et al., 2018). The predicted segmentation mask provides the semantic information to the edge detector. The multi-scale attention maps are generated from the segmentation mask, and applied to corresponding multi-scale features. The edge detector predicts the semantic boundaries using the attended features. Using this attention mechanism, the edge detector can focus on relevant regions while alleviating the noise effect of external details. For meta-training of CAFENet, we introduce a simple yet powerful regularization method, Multi-Split Matching Regularization (MSMR), performing metric learning on multiple low-dimensional embedding sub-spaces during meta-training. The main contributions of this paper are as follows. First, we introduce a few-shot semantic edge detection problem for performing semantic edge detection on previously unseen objects using only a few training examples. Second, we introduce two new datasets of SBD-5 i and FSE-1000 for few-shot edge detection. Third, we propose a few-shot edge detector, CAFENet and validate the performance of the proposed method through experiments.

Figure 2: Network architecture overview of proposed CAFENet. ResNet-34 encoder E (1) ∼ E (4) extracts multi-level semantic features. The segmentator module generates a segmentation prediction using query feature from E (4) and prototypes PF G, PBG from support set features. Small bottleneck blocks S (0) ∼ S (4) transform the original image and multi-scale features from encoder blocks to be more suitable for edge detection. The attention maps generated from segmentation prediction are applied to multi-scale features to localize the semantically related region. Decoder D (0) ∼ D (4) takes attentive multi-scale features to give edge prediction.

Figure 3: Comparison between (a) High-dimensional feature matching used in (Dong & Xing, 2018; Wang et al., 2019) and (b) split-wise feature matching in MSMR

Figure 4: An example of activation map of(Yosinski et al., 2015) before and after pixel-wise semantic attention (warmer color has higher value). As seen, the attention mechanism makes encoder side-outputs attend to the regions of the target object (horse in the figure).

Based on the SBD dataset of(Hariharan et al., 2011) for semantic edge detection, we propose a new SBD-5 i dataset. With reference to the setting of Pascal-5 i , 20 classes of the SBD dataset are divided into 4 splits. In the experiment with split i, 5 classes in the ith split are used as test classes C test . The remaining 15 classes are utilized as training classes C train . The training set D train is constructed with all image-annotation pairs whose annotation include at least one pixel from the classes in C train . For each class, the boundary pixels which do not belong to that class are considered as background. The test set D test is also constructed in the same way as D train , using C test this time. Considering the difficulty of few-shot setting and severe annotation noise of the SBD dataset, we extract thicker edges. We utilize edges extracted from the segmentation mask as ground truth instead of original boundary labels of the SBD dataset, and thickness of extracted edge lies between 3 ∼ 4 pixels on average. We conduct 4 experiments with each split of i = 0 ∼ 3, and report performance of each split as well as the averaged performance. Note that unlike Pascal-5 i , we do not consider division of training and test samples of the original SBD dataset. As a result, the images in D train might appear in D test with different annotation from class in C test .

Figure C.1: Qualitative examples of 5-shot edge detection on SBD-5 i dataset.

Figure C.2: Qualitative examples of 5-shot edge detection on FSE-1000 dataset.

-way N s -shot setting, each training episode is constructed by N c classes sampled from C train . When N c categories are given, N s support samples and N q query samples are randomly chosen from D train for each class. In evaluation, the performance of the model is evaluated using test episodes. The test episodes are constructed in the same way as the training episodes, except N c classes and corresponding support and query samples are sampled from C test and D test .In this work, we address N c -way N s -shot semantic edge detection. The goal is training the model to generalize to N c unseen classes given only N s images and their edge labels. Based on the few labeled support samples, the model should produce edge predictions of query images which belong to N c unencountered classes.

Evaluation results of proposed CAFENet on SBD-5 i . 1000 randomly sampled test episodes are used for evaluation. MF and AP scores are measured by %

1: Ablation experiment results of proposed CAFENet on FSE-1000. 1000 randomly sampled test episodes are used for evaluation. MF and AP scores are measured by %

2: Ablation experiment results of proposed CAFENet on SBD-5 i . 1000 randomly sampled test episodes are used for evaluation. MF and AP scores are measured by % ,bike,bird,boat,bottle bus,car,cat,chair,cow table,dog,horse,mbike,person plant,sheep,sofa,train,tv





6: Comparison of different methods to apply attention on SBD-5 i . Figures C.1 and C.2, we visualize the qualitative results of CAFENet for FSE-1000 and SBD-5 i , respectively. We can see that the proposed CAFENet successfully detect the edge of the target class from given images. In FigureC.1, we can compare the qualitative results of edge predictions from different methods. We can see that the 'DFF + Finetune' method succeeds in finding the edges of the objects, but it lacks the ability to learn semantic information and hardly distinguish the boundary of target class from the other boundaries. For 'PANet + Sobel' method, on the other hand, it successfully understands the semantic information and localizes the target object, but it fails to refine the correct boundary. The proposed CAFENet, however, is capable of localizing the objects from target class and detecting the correct boundary at the same time. Figure C.3 visualizes more qualitative results on SBD-5 i from ablation experiments. We illustrate and compare the boundary prediction results of the baseline, Seg, Seg + Att, and Seg + Att + MSMR methods. For fair comparison, all methods share the same support set. From the results, we can clearly see that the techniques proposed in CAFENet steadily improve the quality of edge prediction.

1: 1-way 5-shot results with the multi-angle input test on FSE-1000. 1000 randomly sampled test episodes are used for evaluation. MF and AP scores are measured by % Table F.2: 1-way 5-shot results with the multi-angle input test on SBD-5 i . 1000 randomly sampled test episodes are used for evaluation. MF and AP scores are measured by %

6. CONCLUSION

In this paper, we establish the few-shot semantic edge detection problem. We proposed the Class-Agnostic Few-shot Edge detector (CAFENet) based on a skip architecture utilizing multi-scale features. To compensate the shortage of semantic information in edge labels, CAFENet employs a segmentation module in low resolution and utilizes segmentation masks to generate attention maps. The attention maps are applied to multi-scale skip connection to localize the semantically related region. We also present the MSMR regularization method splitting the feature vectors and prototypes into several low-dimension sub-vectors and solving multiple metric-learning sub-problems with the sub-vectors. We built two novel datasets of FSE-1000 and SBD-5 i well-suited to few-shot semantic edge detection. Experimental results demonstrate that the proposed method significantly outperforms the baseline approaches relying on fine-tuning or few-shot semantic segmentation.

A ADDITIONAL EXPERIMENTAL SETUP

In this section, we provide detailed information about experimental setup. We adopt the ImageNet pretrained ResNet-34 with 64-128-256-512 channels for each residual block from [pytorch framework] as the encoder. To construct the skip architecture, we employ the bottleneck block of ResNet as the post-processing blocks S (1) ∼ S (4) . Each bottleneck block consists of two 1x1 convolutional layers and one 3x3 convolutional layer with expansion rate of 4. Dropout with the ratio of 0.25 is applied to the end of each bottleneck block. For the ASPP Module in front of S (3) , we adopt the dilation rate of 1,4,7,11. The segmentation module generates a segmentation prediction with the rounding threshold value λ of 0.4. For decoder, each decoder block is composed of three consecutive 3x3 convolutional layers, and dropout with the ratio of 0.25 is again located at the end of each layer. group G i belongs to the foreground is calculated as the mean of pixel values T i . The groups with probability above the threshold λ are determined as the foreground groups, and the pixels belongs to foreground groups are marked as foreground pixels. We set the threshold value λ to be 20/255. E.2 SBD-5 i SBD-5 i is constructed based on the existing semantic edge detection dataset (SBD). Due to the noise of boundary annotations in original SBD, we utilize the thicker edge as done in FSE-1000. To extract thicker edge, we generate the segmentation labels from the edge labels using Algorithm D.2 instead of using existing segmentation labels of SBD. M, T ← 0 W,H Initialize M, as zero matrix having same shape with y for h = 1,...,H do H is height of the image cnt, mode ← 0 for w = 1,...,W do W is width of the image if y h,w = mod(mode + 1, 2) then Accumulate changes of pixel value cnt ← cnt + 1 mode ← mod(mode + 1, 2) if mod(cnt, 4) = 0 and cnt = 0 then Check if there are FG pixels in the row cnt , mode ← 0 for w = 1,...,W do Find location of FG pixels in the rowRecord location of FG pixels in the row for w = 1,...,W do Repeat the same process for every column cnt, mode ← 0 for h = 1,...,H do if y h,w = mod(mode + 1, 2) then cnt ← cnt + 1 mode ← mod(mode + 1, 2) if mod(cnt, 4) = 0 and cnt = 0 then cnt , mode ← 0 for h = 1,...,H do if y h ,w = mod(mode + 1, 2) then cnt ← cnt + 1 mode ← mod(mode + 1, 2) if mod(cnt , 4) = 2 then T h ,w ← 1 for i = 1, ..., n do T i ← T h,w|(h,w)∈G i if mean(T i ) ≥ λ then Check the probability that G i belongs to foreground M h,w|(h,w)∈G i ← 1 1 means a foreground pixel else M h,w|(h,w)∈G i ← 0 0 means a background pixel return M Return segmentation annotation labels using Algorithm D.1 with a radius value of 4. This process allows us to train the proposed CAFENet using only the edge labels.While all images in FSE-1000 have the same size, images in SBD-5 i have different size. However, constructing the training episode as a mini-batch requires images with the same size. Previous works on semantic edge detection typically apply random cropping to deal with this issue. For the few-shot setting, however, random cropping severely degrades informativeness of support set and consequently hinders learning. Alternatively, we utilize the training examples resized to 320 × 320 to maintain the information of images as much as possible. When resizing the edge labels for training, we first generate segmentation labels in original scale using Algorithm D.2 and resize the segmentation labels to 320 × 320. Then, we extract edge labels from resized segmentation labels using Algorithm D.1 with a radius value of 3.

