LEARNING TO JOINTLY SHARE AND PRUNE WEIGHTS FOR GROUNDING BASED VISION AND LANGUAGE MODELS

Abstract

Transformers have been successful in processing different data modalities, such as language and image data, which could use transformers with similar architectures to achieve good performance. Leveraging this observation, we propose weight sharing across two transformer backbones and within the same transformer backbone and pruning across two backbones in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and ( 2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the multimodal pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. We train the hypernetwork and model weights iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in pre-training step, we perform weight sharing and pruning and fine-tune the compressed model on downstream tasks. Finally, we perform experiments on vision and language tasks, including Referring Expression Comprehension (REC), Visual Question Answering (VQA), and Object Detection using the state-of-the-art grounding based models: MDETR and GLIP. Our experiments show that we can compress these models by 35 -40% by sharing and pruning MSA and FFN weights without almost any loss in accuracy.

1. INTRODUCTION

The dominant architecture in natural language processing (NLP) is Transformer (Vaswani et al., 2017) . Besides NLP, recent advance in computer vision shows that transformer based model, like ViT (Dosovitskiy et al., 2021) or DeiT (Touvron et al., 2020) , can achieve similar or even better performance than convolutional neural networks (CNNs) on various tasks. As a result, it allows us to use architecturally similar models on cross-modal tasks with vision and language data. This setting naturally provides foundations to structurally share weights across different modalities. The advantage of weight sharing is that it encourages weight reuse and thus reduces the number of parameters while maintaining the model capacity to some extent. On the other hand, existing weight sharing techniques have some limitations. Most of them (Lee et al., 2021; You et al., 2022; Lan et al., 2019; Reid et al., 2021) use manually designed sharing rules to share a whole layer or block, largely restricting the flexibility of weight sharing. This reduced flexibility can lead to drastic performance drops. To maximally utilize model parameters, we propose to unify cross-modal sharing, layerwise sharing, and pruning, all in a single unified framework. Unlike previous works, the minimal structure of these operations is a weight vector instead of a whole layer or block, which drastically increases the flexibility of sharing and pruning. Also, to avoid using manually designed strategies, the positions of sharing and pruning are learned in an end-to-end differentiable manner.

