LEARNING TO JOINTLY SHARE AND PRUNE WEIGHTS FOR GROUNDING BASED VISION AND LANGUAGE MODELS

Abstract

Transformers have been successful in processing different data modalities, such as language and image data, which could use transformers with similar architectures to achieve good performance. Leveraging this observation, we propose weight sharing across two transformer backbones and within the same transformer backbone and pruning across two backbones in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and ( 2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the multimodal pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. We train the hypernetwork and model weights iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in pre-training step, we perform weight sharing and pruning and fine-tune the compressed model on downstream tasks. Finally, we perform experiments on vision and language tasks, including Referring Expression Comprehension (REC), Visual Question Answering (VQA), and Object Detection using the state-of-the-art grounding based models: MDETR and GLIP. Our experiments show that we can compress these models by 35 -40% by sharing and pruning MSA and FFN weights without almost any loss in accuracy.

1. INTRODUCTION

The dominant architecture in natural language processing (NLP) is Transformer (Vaswani et al., 2017) . Besides NLP, recent advance in computer vision shows that transformer based model, like ViT (Dosovitskiy et al., 2021) or DeiT (Touvron et al., 2020) , can achieve similar or even better performance than convolutional neural networks (CNNs) on various tasks. As a result, it allows us to use architecturally similar models on cross-modal tasks with vision and language data. This setting naturally provides foundations to structurally share weights across different modalities. The advantage of weight sharing is that it encourages weight reuse and thus reduces the number of parameters while maintaining the model capacity to some extent. On the other hand, existing weight sharing techniques have some limitations. Most of them (Lee et al., 2021; You et al., 2022; Lan et al., 2019; Reid et al., 2021) use manually designed sharing rules to share a whole layer or block, largely restricting the flexibility of weight sharing. This reduced flexibility can lead to drastic performance drops. To maximally utilize model parameters, we propose to unify cross-modal sharing, layerwise sharing, and pruning, all in a single unified framework. Unlike previous works, the minimal structure of these operations is a weight vector instead of a whole layer or block, which drastically increases the flexibility of sharing and pruning. Also, to avoid using manually designed strategies, the positions of sharing and pruning are learned in an end-to-end differentiable manner. To pursue a better trade-off between the model performance and the parameter efficiency, we aim to maximize flexibility by utilizing the structure of transformer backbones. If only cross-modal sharing is considered, there will be an upper bound for the compression rate (∼ 50%) when sharing all layers of one backbone for another one. Another direction is to share layers within a single backbone similar to Albert (Lan et al., 2019) , however, the downside is that there is no prior success of cross-layer sharing for the vision transformer. In addition to weight sharing, pruning is also an option; however, Li et al. (2020); Yu et al. (2022) show that it limits the capacity of the model when the compression rate is high. We argue that, especially in multimodal tasks, it can be hard to achieve a high compression rate while maintaining high accuracy by using only pruning or sharing. In this direction, we unify cross-modal sharing, layer-wise sharing, and pruning into a single framework to maximize the flexibility for sharing and pruning. With the increased flexibility, finding which weights to prune or share can be very difficult. To solve this problem, we design a hypernetwork for learning the binary structure vectors for sharing and pruning efficiently. The hypernetwork can effectively capture the complex interactions between pruning and sharing across different layers and modalities. Since we can not share and prune a weight vector at the same time, we apply constraints on different parts of structure vectors. We use a regularization term to softly align the learned structures and backbone weights instead of directly sharing and pruning by applying structure vectors. The benefit of such a regularization term is that we do not need to add a separate fine-tuning process to the whole pre-training step. Finally, we iteratively train the hypernetwork and the model so that the learned structures can adapt to the changes in the model during training. With these novel designs, our method learns suitable structures for sharing and pruning with small extra costs. We evaluate the effectiveness of our method with two state-of-the-art vision and language grounding models: MDETR (Kamath et al., 2021), GLIP (Li et al., 2022) . We use the CLIP text transformer (Radford et al., 2021) and DeiT (Touvron et al., 2020) as the text backbone and the vision backbone, and thus two backbones have similar architectures. Our method is evaluated on different downstream tasks: RefCOCO/RefCOCO+ (Yu et al., 2016 )/RefCOCOg (Mao et al., 2016) for referring expression comprehension (REC), GQA (Hudson & Manning, 2019) for visual question answering (VQA), MS-COCO object detection (Lin et al., 2014) and Flickr-30k (Plummer et al., 2015) for phrase grounding. On these benchmarks, our method can remove around 40% ∼ 45% parameters of the backbones and 35% ∼ 40% of all the parameters with almost no accuracy drop. In some tasks, our method even outperforms the original model. These results show that the proposed framework achieves a prominent trade-off between the number of parameters and accuracy.

2. RELATED WORKS

Vision and Language Models. Existing vision and language models can be broadly grouped into two categories: (1) two-stage and (2) single-stage. Two-stage methods (Yu et al., 2018; Chen et al., 2020; Lu et al., 2019) utilize off-the-shelf object detectors, e.g., Faster-RCNN, to detect objects and represent them with convolutional features. For referring expression comprehension, the referring expression is matched to its appropriate region (Yu et al., 2018) . For other vision and language tasks, region descriptions are passed through two transformers to get image and language representations (Lu et al., 2019; Zhang et al., 2021a; Gan et al., 2020) . Single-stage methods (Kamath et al., 2021; Deng et al., 2021; Yang et al., 2020; Chen et al., 2018; Li & Sigal, 2021) avoid using a detached off-the-shelf object detector and perform end-to-end training and inference, reducing the computational complexity of the two-stage methods. Another advantage of single-stage methods is that the full model can be pre-trained for text-conditioned object detection as in MDETR (Kamath et al., 2021) , GLIP (Li et al., 2022 ) GLIP-v2 (Zhang et al., 2022) , etc. Pruning with Transformers. With the increasing popularity of vision transformers, there are many works that perform pruning on different components, such as pruning tokens (Kong et al., 2021; Xu et al., 2021; Rao et al., 2021) , pruning structures or weights (Chen et al., 2021; Yin et al., 2023; Lou et al., 2022b) , or all of the above (Yang et al., 2021) . The pruning for language transformers (Li et al., 2020; Gale et al., 2019) can be traced back even earlier. Unlike these works, in the cross-modal setting, we only prune weights that are not useful in both modalities. Transformer-based Unified Backbones. Transformers recently enjoyed tremendous popularity in processing different modalities, including images, videos (Girdhar et al., 2019 ), speech (Dong et al., 2018 ), audio (Jaegle et al., 2021 ), and language (Jaegle et al., 2021; Hu & Singh, 2021; Zhang et al., 2021b) . The universality of transformers has been exploited for multi-tasking using the same vision backbone for different vision-only tasks (Girdhar et al., 2022) , the same vision and language

