LEARNING TO JOINTLY SHARE AND PRUNE WEIGHTS FOR GROUNDING BASED VISION AND LANGUAGE MODELS

Abstract

Transformers have been successful in processing different data modalities, such as language and image data, which could use transformers with similar architectures to achieve good performance. Leveraging this observation, we propose weight sharing across two transformer backbones and within the same transformer backbone and pruning across two backbones in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and ( 2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the multimodal pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. We train the hypernetwork and model weights iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in pre-training step, we perform weight sharing and pruning and fine-tune the compressed model on downstream tasks. Finally, we perform experiments on vision and language tasks, including Referring Expression Comprehension (REC), Visual Question Answering (VQA), and Object Detection using the state-of-the-art grounding based models: MDETR and GLIP. Our experiments show that we can compress these models by 35 -40% by sharing and pruning MSA and FFN weights without almost any loss in accuracy.

1. INTRODUCTION

The dominant architecture in natural language processing (NLP) is Transformer (Vaswani et al., 2017) . Besides NLP, recent advance in computer vision shows that transformer based model, like ViT (Dosovitskiy et al., 2021) or DeiT (Touvron et al., 2020) , can achieve similar or even better performance than convolutional neural networks (CNNs) on various tasks. As a result, it allows us to use architecturally similar models on cross-modal tasks with vision and language data. This setting naturally provides foundations to structurally share weights across different modalities. The advantage of weight sharing is that it encourages weight reuse and thus reduces the number of parameters while maintaining the model capacity to some extent. On the other hand, existing weight sharing techniques have some limitations. Most of them (Lee et al., 2021; You et al., 2022; Lan et al., 2019; Reid et al., 2021) use manually designed sharing rules to share a whole layer or block, largely restricting the flexibility of weight sharing. This reduced flexibility can lead to drastic performance drops. To maximally utilize model parameters, we propose to unify cross-modal sharing, layerwise sharing, and pruning, all in a single unified framework. Unlike previous works, the minimal structure of these operations is a weight vector instead of a whole layer or block, which drastically increases the flexibility of sharing and pruning. Also, to avoid using manually designed strategies, the positions of sharing and pruning are learned in an end-to-end differentiable manner. To pursue a better trade-off between the model performance and the parameter efficiency, we aim to maximize flexibility by utilizing the structure of transformer backbones. If only cross-modal sharing is considered, there will be an upper bound for the compression rate (∼ 50%) when sharing all layers of one backbone for another one. Another direction is to share layers within a single backbone similar to Albert (Lan et al., 2019) , however, the downside is that there is no prior success of cross-layer sharing for the vision transformer. In addition to weight sharing, pruning is also an option; however, Li et al. (2020) ; Yu et al. (2022) show that it limits the capacity of the model when the compression rate is high. We argue that, especially in multimodal tasks, it can be hard to achieve a high compression rate while maintaining high accuracy by using only pruning or sharing. In this direction, we unify cross-modal sharing, layer-wise sharing, and pruning into a single framework to maximize the flexibility for sharing and pruning. With the increased flexibility, finding which weights to prune or share can be very difficult. To solve this problem, we design a hypernetwork for learning the binary structure vectors for sharing and pruning efficiently. The hypernetwork can effectively capture the complex interactions between pruning and sharing across different layers and modalities. Since we can not share and prune a weight vector at the same time, we apply constraints on different parts of structure vectors. We use a regularization term to softly align the learned structures and backbone weights instead of directly sharing and pruning by applying structure vectors. The benefit of such a regularization term is that we do not need to add a separate fine-tuning process to the whole pre-training step. Finally, we iteratively train the hypernetwork and the model so that the learned structures can adapt to the changes in the model during training. With these novel designs, our method learns suitable structures for sharing and pruning with small extra costs. We evaluate the effectiveness of our method with two state-of-the-art vision and language grounding models: MDETR (Kamath et al., 2021) , GLIP (Li et al., 2022) . We use the CLIP text transformer (Radford et al., 2021) and DeiT (Touvron et al., 2020) as the text backbone and the vision backbone, and thus two backbones have similar architectures. Our method is evaluated on different downstream tasks: RefCOCO/RefCOCO+ (Yu et al., 2016) /RefCOCOg (Mao et al., 2016) for referring expression comprehension (REC), GQA (Hudson & Manning, 2019) for visual question answering (VQA), MS-COCO object detection (Lin et al., 2014) and Flickr-30k (Plummer et al., 2015) for phrase grounding. On these benchmarks, our method can remove around 40% ∼ 45% parameters of the backbones and 35% ∼ 40% of all the parameters with almost no accuracy drop. In some tasks, our method even outperforms the original model. These results show that the proposed framework achieves a prominent trade-off between the number of parameters and accuracy.

2. RELATED WORKS

Vision and Language Models. Existing vision and language models can be broadly grouped into two categories: (1) two-stage and (2) single-stage. Two-stage methods (Yu et al., 2018; Chen et al., 2020; Lu et al., 2019) utilize off-the-shelf object detectors, e.g., Faster-RCNN, to detect objects and represent them with convolutional features. For referring expression comprehension, the referring expression is matched to its appropriate region (Yu et al., 2018) . For other vision and language tasks, region descriptions are passed through two transformers to get image and language representations (Lu et al., 2019; Zhang et al., 2021a; Gan et al., 2020) . Single-stage methods (Kamath et al., 2021; Deng et al., 2021; Yang et al., 2020; Chen et al., 2018; Li & Sigal, 2021) avoid using a detached off-the-shelf object detector and perform end-to-end training and inference, reducing the computational complexity of the two-stage methods. Another advantage of single-stage methods is that the full model can be pre-trained for text-conditioned object detection as in MDETR (Kamath et al., 2021) , GLIP (Li et al., 2022 ) GLIP-v2 (Zhang et al., 2022) , etc. Pruning with Transformers. With the increasing popularity of vision transformers, there are many works that perform pruning on different components, such as pruning tokens (Kong et al., 2021; Xu et al., 2021; Rao et al., 2021) , pruning structures or weights (Chen et al., 2021; Yin et al., 2023; Lou et al., 2022b) , or all of the above (Yang et al., 2021) . The pruning for language transformers (Li et al., 2020; Gale et al., 2019) can be traced back even earlier. Unlike these works, in the cross-modal setting, we only prune weights that are not useful in both modalities. Transformer-based Unified Backbones. Transformers recently enjoyed tremendous popularity in processing different modalities, including images, videos (Girdhar et al., 2019) , speech (Dong et al., 2018) , audio (Jaegle et al., 2021) , and language (Jaegle et al., 2021; Hu & Singh, 2021; Zhang et al., 2021b) . The universality of transformers has been exploited for multi-tasking using the same vision backbone for different vision-only tasks (Girdhar et al., 2022) , the same vision and language backbone for different vision and language tasks (Hu & Singh, 2021) . Likhosherstov et al. (2021) , on the other hand, performs multitasking with the same transformer backbone on different modalities, and Wang et al. (2021) pre-trains the model on vision-only, text-only, and vision-language tasks and fine-tune the model on vision or text-only or vision-language tasks. Similarly, Li et al. (2021) ; You et al. (2021) pre-train a unified transformer backbone with images and their captions and fine-tune the backbone for the single modality downstream tasks. Different from these studies, we propose structural cross-modal, layer-wise sharing and pruning across two transformer backbones for grounding based vision and language downstream tasks with almost no loss in accuracy. Weight Sharing with Transformers. There has been a number of studies that focus on sharing weights in a single transformer on language-only tasks (Lan et al., 2019; Reid et al., 2021; Takase & Kiyono, 2021; Lou et al., 2022a) . These studies mostly re-use a language transformer's full encoder (MSA+FFN) at different depths using some manually defined strategy. Parameter efficient crossmodal transformers (Lee et al., 2021) reduces parameters with manually designed weight sharing strategies. Another recent study, Kim et al. (2022) , performs weight sharing in the multimodal fusion network, which can be orthogonal to our study as we explore weight sharing across the transformer backbones. Finally, a recent work, MS-CLIP (You et al., 2022) , shares weights across modalities by rigorously examining the architecture choices and adding early specialization layers. On the other hand, our method is general and flexible as we learn where to share and prune weight vectors in a vision and language model. Additionally, unlike MS-CLIP, we do not change the architecture of the original model. As a result, our method can be applied to any vision and language model with transformer backbones. Finally, different from these studies, we focus on grounding based vision and language models, which have shown impressive performance on many vision and language tasks (Kamath et al., 2021; Li et al., 2022; Zhang et al., 2022) .

3. LEARNING TO SHARE AND PRUNE STRUCTURES FOR VISION AND LANGUAGE TRANSFORMERS

3.1 OVERVIEW In a typical vision and language model, we have two backbones: a vision backbone (v) and a text backbone (t). We use W * l * ∈ {v, t} to denote the weight matrix of l'th layer from the vision (W v l ) or the text backbone (W t l ), and l = 1, . . . , L. L is the total number of layers. This definition also applies to sharing and pruning vectors: s * and m * . Since we consider architecturally similar backbones, the vision and text backbone have the same L. Usually, transformers are used for processing text and vision data. A classic transformer has two core components: (1) Multi-Head Attention (MSA) and (2) Feed-Forward Network (FFN) layers. They are the targets for structural weight sharing and pruning since these components contain most of the weights in backbones. To find which structures to prune or share, we first need to optimize the sharing and pruning vectors produced by the hypernetwork during the pre-training process of MDETR (Kamath et al., 2021) and GLIP (Li et al., 2022) . To further increase the flexibility of the pruning and sharing process, layer-wise sharing (inspired by Albert (Lan et al., 2019) but at a fine-grained level) across the text backbone is also performed. A detailed comparison of different sharing and pruning options is given in Fig. 2 . Since we learn discrete structure vectors, directly applying them in the fine-tuning step can result in severe accuracy drops. We instead softly align model weights and learned structures via a regularization term in the pre-training step. After the pretraining step, we conduct structural pruning and sharing based on learned structure vectors. Finally, we fine-tune the compressed model for the downstream task. An overview of how our method shares and prunes weights is provided in Fig. 1 .

3.2. SHARING AND PRUNING ACROSS TWO BACKBONES

For MSA layers, we have three weight matrices for query W * q ∈ R d×d , key W * k ∈ R d×d and value W * v ∈ R d×d and again * ∈ {v, t}. With input tokens X * ∈ R N * ×d , the final Q * , K * and V * for self-attention is obtained by: Q * = X * W * q , K * = X * W * k , V * = X * W * v , where d is the embedding dimension, N * represents the number of tokens given the modality. To make minimal changes to the original model, we can perform pruning on W * q and W * k , and structural sharing are applied to all three weight matrix between backbones and across layers. For pruning, we use a binary vector m * ∈ {0, 1} d on W * q and W * k . Let's use the vision backbone as an example for pruning. The production of the query and key when calculating MSA then becomes: Q v l (K v l ) T = X v l (W v l q ⊙ m v l )(W v l k ⊙ m v l ) T (X v l ) T , where m v l is first resized to have the same dimension of W v l q and ⊙ is the element-wise production. X v l is the feature map from the previous layer. We perform cross-modal sharing between two backbones, and cross-layer sharing for the text backbone. To achieve sharing, we use a binary vector s * ∈ {0, 1} d to share weights between two modalities and different layers. Weights of the text backbone are used for cross-modal sharing. As a result, a weight vector could be implicitly used in different layers across text and vision backbones. Let's use W v l q as an example for cross-modal sharing: W v l q = s l ⊙ W v l q + (1 -s l ) ⊙ W t l q , where s l is also first expanded to have the same size of W v l q . In short, weight vectors of index i with s l i = 0 are shared. Without loss of generality, we omitted the modal notation (v or t) for cross-modal sharing s l , since the sharing direction is fixed. Similarly, cross-layer sharing has the following setup: W t l q = s t l ⊙ W t l q + (1 -s t l ) ⊙ W t b q , where W t l q represents the weights of lth layer in the text backbone, W t b q is the assigned base weights, and s t l is the vector used for cross-layer sharing. As a result, the final weights for the text backbone is divided into two parts: layer-specific weights and shared weights from base layers. A naive way for cross-layer sharing uses weights from a certain layer as base weights for all other layers (Lan et al., 2019) . However, it can restrict the diversity of weights across layers. To enhance the flexibility of cross-layer sharing, we divide the text backbone into several blocks, and each block has its own base weights. Specifically, we establish a set of base layers (in sorted order) B = {b 1 , b 2 , • • • , b |B| }. All layers are naturally split into several blocks based on base layers, e.g., the first block contains layers with b 1 ≤ l < b 2 . In our experiments, the base weight is set to the first layer of each block. When l is a base layer (l ∈ B), we force s t l = 1 since we assign the base layer its own weights and potentially reuse it for the other layers in the same block. Note that after cross-layer sharing, the base weights for cross-modal sharing is also changed. By putting cross-layer sharing and cross-modal sharing together, we have: W v l q = s l ⊙ W v l q + (1 -s l )s t l ⊙ W t l q + (1 -s l )(1 -s t l ) ⊙ W t b q . (5) As a result, the final W v l q consists weights from W t l q (weights from the text backbone of the same layer), W t b q (weights from the text backbone of the base layer) and vision modality specific weights. For FFN layers, we have a weight matrix W * E ∈ R d×d ′ expanding the feature dimension, and another weight matrix W * C ∈ R d ′ ×d compressing the feature dimension. To keep the feature map size d unchanged, we conduct sharing and pruning along the dimension with size d ′ and the process is similar to MSA layers. The detailed formulation is provided in the supplementary materials. We do not apply sharing and pruning for other layers including Layer Norm (Ba et al., 2016) since they contain a very small amount of parameters.

3.3. CONFLICTS OF STRUCTURE VECTORS

Without any constraint, m and s could have conflicts. For example, if we prune weights, it becomes meaningless to share the same weights. To resolve such conflicts, we impose the following constraint for cross-modal sharing and pruning: (m v l i , s l i ) ∈ C, (m t l i , s l i ) ∈ C, where C = {(x, y)|(x, y) ̸ = (0, 0)}, where i is the index of sharing or pruning for a certain weight vector. If (m i , s i ) ∈ C, then the structure vector does not share and prune the same weight vector which resolves the conflicts. This constraint can also be applied for cross-layer sharing and pruning for the text backbone, but in a more sophisticated way: (m t l i , s t l i ) ∈ C, where C = {(x, y)|(x, y) ̸ = (0, 0)}, l / ∈ B, (m t l i , s t l i s t l+1 i • • • s t b ′ i ) ∈ C, where C = {(x, y)|(x, y) ̸ = (0, 0)}, l ∈ B, where b ′ represents the last element of the current block. For example, if l = b 1 , the current block consists layers of b 1 ≤ l < b 2 , and b ′ = b 2 -1. With s t l i s t l+1 i • • • s t b ′ i we represent all the shared elements from different layers in the current block, and these elements should be kept in the base layer. We do not add this constraint between s l i and s t l i , since there is no conflict between sharing. To directly apply this constraint, we can prioritize m i or s i and set the other one manually. For example, let m t l i = s l i = 0, we can directly set m t l i = 1 to comply with the constraint in Eq. 6 if s i is prioritized (shared weights will not be pruned). In practice, we prefer sharing weights to pruning weights, since sharing preserves the model capacity to some extent.

3.4. LEARNING STRUCTURE VECTORS

To generate m, s, we use a hypernetwork (HN) parameterized by θ and Gumbel-Sigmoid (Jang et al., 2016) technique: m, s = HN(z, θ), (8) where z is the input to the HN, and it is a predefined vector. Basically, the HN is composed of GRUs (Chung et al., 2014) and multilayer perceptrons (MLPs). GRU can capture inter-layer interactions, and MLP is used for intra-layer interactions. More details of the HN are presented in the supplementary materials. To learn the HN, we want to solve the following optimization problem: min θ L pre-training (x, y; m, s) + λR(P (m, s), pP total ), where x is the input sample of the image and text pair, y is its label, L pre-training is the original pretraining loss from MDETR (Kamath et al., 2021) or GLIP (Li et al., 2022) , and its model structure is decided by m, s. On the other hand, λ controls the strength of R(P (m, s), pP total ). The regularization loss, R(P (m, s), pP total ), controls how much parameters we should keep given by p ∈ (0, 1]. P (m, s) in the regularization loss represents the remaining number of parameters decided by m, s and P total is the total number of parameters of MSA+FFN layers in backbones. In practice, we let R(P (m, s), pP total ) = log(max(P (m, s), pP total )/pP total ) (Gao et al., 2021) . On the other hand, R could be any regression loss functions, including MSE and MAE; however, it might be harder to push R small enough, especially when the number of parameters is large. During the optimization of the HN, we keep the weights of the whole model including backbones and the fusion part frozen. Since the time cost for pre-training is quite large, we use a small portion of the whole dataset to train HN. Next, we train model weights on the whole pre-training dataset while controlling its architecture with m, s by training the hypernetwork on a subset iteratively.

3.5. LEARNING TO JOINTLY SHARE AND PRUNE WEIGHTS

To share and prune the weights, we can directly apply m and s (from the hypernetwork) to vision and text transformers. However, it can drop the accuracy since the outputs of the vision and text backbones are drastically changed. Thus, to alleviate this problem, we use a selection based regularization mechanism to softly push selected weights closer (for sharing) or to zeros (for pruning): R w (W, m, s) = ∥(1 -s) ⊙ W t -(1 -s) ⊙ W v ∥ 1 + ∥(1 -s t l ) ⊙ W t l -(1 -s t l ) ⊙ W t b ∥ 1 + m i • ∥W [:,i] ∥ 2 , where the first two terms push selected weight vectors closer and the final term uses Group Lasso to push selected weight vectors to 0. With this regularization term, it gradually aligns model weights to the learned structure vectors which creates a smooth process for reducing the number of model parameters. Given the regularization loss, we learn model weights W by optimizing the following objective function: min W L pre-training (x, y; W ) + γR w (W, m, s), where γ controls the strength of R w (W, m, s). After the pre-training process, the corresponding weights are pruned and shared to compress the model. The compressed model can then be used for fine-tuning on downstream tasks. We present the training process of the proposed method in Algorithm. 1. Note that (1) the forward calculation is different when optimizing the model (Obj. 11) and HN (Obj. 9). When optimizing HN, m, s are applied in the forward calculation, and when optimizing the model, the forward calculation (3) The goal of using HN is to accelerate the learning of m, s, and better capture the complicated interaction between pruning and sharing across modalities and layers. We note that we can also directly set m, s as learnable parameters, however, it can slow down the learning process and thus lower the final performance.

4. EXPERIMENTS

4.1 SETTINGS GLIP and MDETR. We apply our method to two recently published grounding based vision and language models: (1) MDETR (Kamath et al., 2021) and ( 2) GLIP (Li et al., 2022) . MDETR performs end-to-end pre-training on dataset of 1.3M image-text pairs for the task of text-conditioned object detection. On the other hand, GLIP scales up the pre-training dataset for text-conditioned object detection to 27M image-text pairs by training a teacher network on 1.3M image-text pairs with ground truth bounding boxes and using the teacher on 27M image-text pairs to get pseudo bounding boxes. Pre-trained GLIP and MDETR are then used for vision and language tasks. Baselines. We constructed several alternative settings to verify the design choice of our method. The first setting is Fully cross-modal Sharing (FS). For FS, we share all MSA+FFN layers across two backbones during both the pre-training stage and the fine-tuning stage without any regularization term. We note that FS can be considered as a manually designed weight sharing method similar to You et al. (2022) ; Lee et al. (2021) . The second setting is cross-modal Sharing and Pruning (SP). For SP, we only apply cross-modal sharing and pruning for the vision and text backbones. The third setting is cross-modal, Layer-Wise Sharing and Pruning (LWSP). On top of SP, LWSP also shares weights across layers in the text backbone. The third setting is cross-modal, Block-Wise Sharing and Pruning (BWSP). For BWSP, the text backbone is split into three blocks, and each block contains 4 consecutive MSA and MLP layers. Fig. 2 shows the concept of these settings. By default, layer-wise sharing and block-wise sharing are applied for the text backbone, and the weights of the text backbone are reused for the vision backbone for the cross-modal sharing since Lu et al. (2021) showed that language transformers can generalize to vision tasks. Implementation Details. In all experiments, we replace the original backbones used in MDETR and GLIP with the CLIP text transformer and DeiT to have architecturally similar backbones. Due to the resource limitation, we change the GLIP pre-training dataset to be the same as MDETR, and the pre-training dataset for MDETR is not changed. We note that MDETR and GLIP train the models on different size images and test the models on images with smaller size resized to 800 pixels. On the other hand, we resize the images to 224x224 pixels in training and test time as the motivation of our study is to maintain the accuracy of the full model after compression. Due to space limitation, other implementation details are given in supplementary materials.

4.2. RESULTS

MDETR results. For MDETR, we construct two models, MDETR-Small and MDETR-Large, with different model settings. We report results on the validation dataset of Ref-COCO/RefCOCO+/RefOCOCg and GQA. The result of MDETR-Small is shown in Tab. 1. From the table, we can see that the performance of BWSP is on par with the baseline model (relative changes: +0.1/+0.0/-0.5/+0.1 for four tasks). At the same time, BWSP reduces 24M parameters (around 38% of two backbones) compared to the baseline model. LWSP and SP perform worse than BWSP, indicating that increasing flexibility is helpful when reducing the number of parameters. The results of MDETR-Large is shown in Tab. 2. On MDETR-Large, BWSP also performs similarly to the baseline model, but the parameter reduction rate is larger (40% of the baseline model). For both MDETR-Large and MDETR-Small, the manually designed sharing rule FS causes significant performance drops. In addition, given a similar number of parameters, BWSP outperforms other full models with different backbones on both the large and small model setups. Finally, we remove R w during pre-training, and we directly train the hypernetwork and fine-tune the model. We call this setting BWSP w/o R w . As expected, BWSP w/o R w results in a much larger performance drop, which justifies the design choice of using a R w to align model weights and structures during pre-training. GLIP results. We evaluate our method with GLIP on MS-COCO object detection (COCO) and Flickr-30k for phrase grounding. The results on COCO are shown in Tab. 3. From the table, we can see that BWSP still outperforms SP and LWSP. Besides the performance, BWSP also removes more parameters. In Tab. 4, we can have a similar observation. In summary, BWSP provides more flexibility than SP and LWSP, which results in a stronger performance/parameter trade-off.

4.3. DETAILED ANALYSIS

Effects of R w . In Fig. 4 , we plot the weight difference with or without R w . Let's consider crossmodal sharing as an example, for lth layer, we plot 1 n n i ∥(1 -s i ) ⊙ W t l -(1 -s i ) ⊙ W v l ∥ 1 for regularized weights, and we plot 1 n n i ∥s i ⊙ W t l -s i ⊙ W v l ∥ 1 for weights without regularization. The figure shows that the weight difference is much smaller with regularization for both pruning and cross-modal sharing. For layer-wise sharing, the difference is smaller since the ratio of layer-shared weights is larger. In conclusion, Fig. 4 shows that the proposed regularization R w can effectively encourage sharing and pruning of desired weights, and learning other weights is almost intact. Settings Val Test #PT #PB R@1 R@5 R@10 R@1 R@5 R@ Analysis of Pruned and Shared Weights. In Fig. 3 , we plot the details of pruned and shared weights of BWSP for MDETR-Small. We observe that the layer-wise sharing rate is large for both FFN and MSA layers of the text backbone. For the vision backbone, earlier layers are almost preserved for both FFN and MSA layers. We also observe that FFN layers from the vision backbone are shared more aggressively whereas MSA layers are only shared for later layers. On the other hand, the ratio of pruning is not obvious except for the vision FFN layers. In summary, the text backbone contributes more to the reduction of weights, which is reasonable since vision tasks are more complex than language tasks in general for the tasks required for MDETR and GLIP models.

5. CONCLUSION

In this paper, we investigate how to perform structural sharing and pruning for grounding based vision and language models. Specifically, we perform cross-modal sharing and pruning across the vision and text backbones and layer-wise sharing in the text backbone. With larger flexibility, it becomes harder to select the right set of weight vectors to share and prune. To overcome this challenge, we propose an hypernetwork to capture the interactions of pruning and sharing across layers and modalities. Next, in the pre-training stage, we gradually align the backbones' weights with the desired architecture learned by the hypernetwork. Finally, the model is fine-tuned for downstream tasks. Our experiments on MDETR and GLIP show that we can reduce 35% ∼ 40% of model weights without a significant loss in accuracy.

A SUPPLEMENTARY MATERIALS

A.1 PRUNING AND SHARING FOR FFN LAYERS We will provide more details for the pruning and sharing of FFN Layers in this section. Recall that we have a weight matrix W * E ∈ R d×d ′ expanding the feature dimension, and another weight matrix W * C ∈ R d ′ ×d compressing the feature dimension for FFN layers. To keep the model dimension d unchanged, we conduct sharing and pruning along the dimension with size d ′ . For the cross-modal sharing, we then have: W v l E = s l • W v l E + (1 -s l ) • W v l E , W v l C T = s l • W v l C T + (1 -s l ) • W v l C T , like MSA layers, s l is first resized to have the same size of W v l E . For layer-wise sharing, we have the following equations: W t l E = s t l ⊙ W t l E + (1 -s t l ) ⊙ W t b E , W t l C T = s t l ⊙ W t l C T + (1 -s t l ) ⊙ W t b C T . ( ) s t l is resized to have the same size of W t l E . Let's take the vision backbone as an example, the pruning for W v l E and W v l C is done by inserting the mask m v l to the intermediate feature map between W v l E and W v l C : Xv l = m v l ⊙ X v l . To simplify notations, we reuse the notations s l , s t l and m v l for FFN layers. Similar to MSA layers, sharing and pruning for FFN layers also satisfy the constraints defined in Eq. 7 and Eq. 6.

A.2 DETAILED STRUCTURE OF HYPERNETWORK

The detailed architecture is summarized in Tab. S1. The input of the hypernetwork z ∈ R 2L×32 , Table S1 : The architecture of the hypernetwork. Input z GRU(32,64)→ LayerNorm→ GeLU MLP l (64, d l s )→ Outputs o l s , l = 1, • • • , L MLP l (64, d v l m )→ Outputs o v l m , l = 1, • • • , L MLP l (64, d t l s )→Outputs o t l s , l = 1, • • • , L MLP l (64, d t l m )→ Outputs o t l m , l = 1, • • • , L and is sampled from a uniform distribution and it is kept unchanged after the training starts. The outputs of GRUs are parallelly fed into MLP layers to get the outputs before the final pruning and sharing vectors. The final s l , m v l , s t l , m t l is obtained by using the Gumbel-Sigmoid (Jang et al., 2016) technique: s l = round(sigmoid(o l s + g + b)/τ )), m v l = round(sigmoid(o v l m + g + b)/τ )), s t l = round(sigmoid(o t l s + g + b)/τ )), m t l = round(sigmoid(o t l m + g + b)/τ )), where sigmoid(•) is the sigmoid function, round(•) is the function that rounds its input to its nearest integer, g follows Gumbel distribution: g ∼ Gumbel(0, 1), b is a constant and τ is the temperature hyper-parameter. We set b = 3.0 so that model structures are kept intact at the beginning. The τ is set to 0.4 for all experiments. Since round(•) is not differentiable, we use the straight-through gradient estimator (Bengio et al., 2013) to circumvent this problem. L is the total number of MSA and FFN layers, and the order of MSA and FFN layers is followed by their natural order in the transformer. The hyper-parameters λ and γ for R and R w are set to 6.0 and 0.0005 for all datasets and tasks. We train the hypernetwork with ADAM and a constant learning rate of 0.001 and weight decay 0.0001. For D sub , we random sample 10% of samples from D. We found that this hyper-parameter setting performs well for both MDETR and GLIP. Other pre-training and fine-tuning settings are identical to the original setting. In our method, p controls how much compressible backbone parameters should be removed. We list the detailed choice of p in Tab. S2. We only need to modify p if we want to change how much weight we want to compress. The results of all MDETR tasks are obtained by fine-tuning on downstream tasks. Results of GLIP on COCO detection task Tab. 3 are also obtained by fine-tuning. We follow the fine-tuning setting in their original papers and codesfoot_0 . Let's take GLIP pre-taining as an example. We use the hyper-parameters given in https://github.com/microsoft/GLIP/blob/main/configs/pretrain/glip Swin T O365.yaml. We replaced BACKBONE, LANGUAGE BACKBONE, and DATASETS in the configuration file, and we inserted hyper-parameters for training the hypernetwork. For Flickr-30k results in Tab. 4, we add a very small fine-tuning period (∼ 3 epochs) to obtain them (including the baseline). For more information on the pre-training dataset, we refer the readers to the MDETR (Kamath et al., 2021) and GLIP (Li et al., 2022) papers. In addition, the implementation of Eq. 6 and Eq. 7 may not be straightforward at first glance. We provide a PyTorch style pseudo code in this section to give an example for Eq. 6. The implementation of Eq. 7 can be done similarly. Table S3 : COCO object detection results. We use GLIP with CLIP+DeiT-Small as the backbone. The results are lower than Tab. 3 since the pre-training dataset is smaller due to the pre-training costs. Rows with blue color are our default settings. To see how different choices of hyper-parameters affect our method, we adjust λ, γ, |D sub | (the number of samples of D sub ) and |B| (the number of blocks) for our method. The overall pretraining cost is quite high for GLIP and MDETR. As a result, we only use Flickr-30k (Plummer et al., 2015) as the pre-training dataset. The experiments are conducted on GLIP with COCO object detection as the evaluation task. All other settings are the same as Tab. 3. γ is used to control the regularization strength of R w in Obj. 11. From Tab. S3, we can see that a smaller γ will negatively affect the final results, and this is probably because weights from different layers and different modalities are not well aligned, which creates difficulty for fine-tuning. This observation is consistent with the results of 'BWSP w/o R w ' (γ = 0) in Tab. 1 and Tab. 2, which all indicates that poor alignment will be harmful to the final performance. λ is used to control the regularization strength of the parameter constraint R(P (m, s), pP total ) in Obj. 9. From Tab. S3, we can see that a too large λ (λ = 10) is harmful to our method. This is because a too large λ will ignore the pre-training loss L pre-training and only focus on reducing the number parameters. As a result, it hinders the hypernetwork from finding the ideal sharing and pruning structures. |D sub | = 0.10|D| creates a good trade-off between pre-training costs and the final performance. |B| controls the number of blocks for cross-layer sharing. We tested three settings, and the results are shown in Tab. S3. When |B| = 2, the final model has fewer unique base weights, which restricts the weight diversity. When |B| = 6, the potential compression rate is limited. Clearly, when |B| = 3, we can have a good trade-off between the weight diversity and the potential compression rate for the text transformer.



https://github.com/ashkamath/mdetr and https://github.com/microsoft/GLIP



Figure1: An overview of weights sharing and pruning across two backbones. Here, for simplicity, we use MSA layers as an example, and the process for the FFN layer is similar. The hypernetwork (HN) generates m, s to guide the pruning and sharing across modalities and within layers in a block. As shown in the figure, to incorporate block-wise sharing, the text backbone is separated into several blocks. The conflicts between sharing and pruning are resolved before they are applied to backbones' weights. Note that different from other weights, W v is not pruned.

Figure 3: Ratios of weight sharing and pruning across the MSA and FFN layers in two backbones. Dashed lines in the text backbone represent block separations.

Figure 4: Weights difference with or without the regularization R w . P represents Pruning, CS represents Cross-modal Sharing, and LS represents Layer-wise Sharing.

|D sub | controls the number of samples in |D sub |. From Tab. S3, we can see that increasing the size of |D sub | is beneficial to the final performance. At the same time, it increases the pre-training costs. If |D sub | is too small, it will definitely negatively affect the results, as shown in the table.

Overview of different settings of sharing and pruning. By including cross-modal, blockwise sharing, and pruning, we can reduce the number of parameters in different ways.

Algorithm 1: Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Tasks Input: the pre-training dataset and a sub-dataset for learning structure vectors: D, D sub ; remained rate of parameters: p; hyper-parameter: λ, γ; pre-training epochs: E; the model for pre-training: f ; the hypernetwork HN parameterized by θ for e := 1 to E do / * Optimizing model weights. Freeze θ in HN * / for a mini-batch (x, y) in D do 1. generate m, s from HN with Eq. 8. 2. apply constraints on m, s defined in Eq. 6 and Eq. 7, and sharing is prioritized. 3. calculate R w (W, m, s) given W and m, s. 4. calculate gradients w.r.t W by minimizing Obj. 11 and update W . end / * Optimizing HN weights. Freeze W in the model. * / for a mini-batch (x, y) in D sub do 1. generate m, s from HN with Eq. 8. 2. apply constraints on m, s defined in Eq. 6 and Eq. 7, and sharing is prioritized. 3. calculate the parameter regularization term R(P (m, s), pP total ). 4. calculate gradients w.r.t θ by minimizing Obj. 9 and update θ. end end Get f ′ by pruning and sharing f based on m, s. Return f ′ for task-specific fine-tuning.

Results on tasks from MDETR. We use MDETR-Small with CLIP+DeiT-Small backbones. #PT/#PB in this table, Tab. 2, Tab. 3 and Tab. 4 represent the number of total parameters and backbone parameters.

Results on tasks from MDETR. We use MDETR-Large with CLIP+DeiT-Base backbones.

COCO object detection results. We use GLIP with CLIP+DeiT-Small as backbone.

Flickr30k results. We use GLIP with CLIP+DeiT-Small backbones.

Recall that the model embedding dimension is d, and d ′ is the dimension after expansion for FFN layers. The dimension of each MLP is decided by whether it is MSA or FFN layers. For sharing operations, d l s and d t l s equal to 3d when they are MSA layers. For pruning operations, d v l m and d t l m are equal to d, and they are shared for query weights and key weights. For FFN layers, d l s , d v l m , d t l s and d t l m equal to d ′ . Choice of p for different models and settings.

