TOKEN MERGING: YOUR VIT BUT FASTER

Abstract

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2× the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2× the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2× for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2× the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

1. INTRODUCTION

The introduction of transformers (Vaswani et al., 2017) from NLP to vision with Vision Transformers (ViTs) by Dosovitskiy et al. (2020) has rapidly advanced the field of computer vision. However, unlike in NLP, vision has been since dominated by domain-specific transformer hybrids like Swin (Liu et al., 2021; Dong et al., 2022) using vision-specific attention, MViT (Fan et al., 2021; Li et al., 2022) using vision-specific pooling, or LeViT (Graham et al., 2021) using vision-specific conv modules. The reason for this trend is simple: efficiency. Adding vision-specific inductive biases enables transformer hybrids to perform better with less compute. Yet, vanilla ViTs still have many desirable qualities: they consist of simple matrix multiplications, making them faster than their raw flop count would suggest; they support powerful self-supervised pre-training techniques such as MAE (He et al., 2022) that can put up state-of-the art results while being fast to train; given their lack of assumptions about the data, they can be applied with little or no changes across many modalities (Feichtenhofer et al., 2022; Huang et al., 2022) ; and they scale well with massive amounts of data (Zhai et al., 2021; Singh et al., 2022) , recently obtaining up to 90.94% top-1 on ImageNet (Wortsman et al., 2022) . However, running these massive models can be troublesome, and reproducing these results with a faster architecture would be difficult. A promising subfield of ViTs have recently emerged where, due to the input-agnostic nature of transformers, tokens can be pruned at runtime to enable a faster model (Rao et al., 2021; Yin et al., 2022; Meng et al., 2022; Liang et al., 2022; Kong et al., 2022 ). Yet, token pruning has several disadvantages: the information loss from pruning limits how many tokens you can reasonably reduce; current methods require re-training the model to be effective (some with extra parameters); most cannot be applied to speed up training; and several prune different numbers of tokens depending on the input content, making batched inference infeasible. In this work, we present Token Merging (ToMe) to combine tokens, rather than prune them. Because of our custom matching algorithm, our method is as fast as pruning while being more accurate. Moreover, our method works with or without training, which unlocks its use on huge models with minimal accuracy drop. Using ToMe during training, we observe actual increases in training speed, in some cases cutting the total training time in half. And we apply ToMe without any modifications to images, video, and audio and find it to be competitive with the SotA in all cases. Our contributions are as follows: we introduce a technique to increase the throughput and real-world training speed of ViT models, both with and without training (Sec. 3) and thoroughly ablate our design choices (Sec. 4.1); we perform extensive experiments on images with several ViT models (Sec. 4.2) and compare to state-of-the-art in architecture design and token pruning methods (Sec. 4.3); we then repeat these experiments for both video (Sec. 5) and audio (Sec. 6) and find ToMe works well across modalities; and we visualize our results and find ToMe merges parts of objects on images (Fig. 4 ) and objects over their entire range of motion on video (Fig. 6 ). We hope ToMe can enable the creation of more powerful, faster ViT models.

2. RELATED WORK

Efficient Transformers. Several works have attempted to create more efficient transformers in both NLP and Vision. Some focus on faster attention (Choromanski et al., 2020; Shen et al., 2021; Dao et al., 2022; Wang et al., 2020; Bolya et al., 2022) , some attempt to prune heads or features (Meng et al., 2022; Voita et al., 2019; Michel et al., 2019) , and some attempt to infuse domainspecific modules (Mehta & Rastegari, 2021; Graham et al., 2021; Liu et al., 2021; 2022a; Dong et al., 2022) . In this paper, we focus on speeding up existing ViT models by merging tokens to match the speed-accuracy trade-off of more complicated domain-specific models, sometimes without training. Token Reduction. Since transformers can operate with any number of tokens, several recent works have attempted to prune the tokens from transformers in both NLP (Goyal et al., 2020; Kim & Cho, 2020; Kim et al., 2021; Lassance et al., 2021) and Vision (Meng et al., 2022; Yin et al., 2022; Kong et al., 2022; Song et al., 2022; Rao et al., 2021; Fayyaz et al., 2022; Yu & Wu, 2021) . However, these methods require training, while our method can be used without training. Moreover, most pruning works are dynamic, i.e., the number of tokens varies between images or sentences. While this benefits accuracy it limits practicality, as samples with different numbers of tokens can no longer be batched. To solve this, most pruning papers apply a mask during training rather than remove tokens, which negates the speed-up from pruning. Our method, on the other hand, can be applied during both inference and training, achieving real-world speed-ups in either case. Combining Tokens. While plenty of works prune tokens, very few combine them. Kong et al. (2022) and Liang et al. ( 2022) combine what they prune into a single token. GroupViT (Xu et al., 2022) , while not focused on efficiency, groups tokens using cross-attention for semantic segmentation. TokenLearner (Ryoo et al., 2021) uses an MLP to reduce the number of tokens. LIT (Pan et al., 2022) learns deformable token merging layers for pooling between stages. Token Pooling (Marin et al., 2021) is the most similar to our token merging but uses a slow kmeans-based approachfoot_0 that doesn't work on an off-the-shelf modelfoot_1 . Until now, no approach has been successful in offering a reasonable speed-accuracy trade-off when combining tokens without training.

3. TOKEN MERGING

Our goal is to insert a token merging module into an existing ViT (Dosovitskiy et al., 2020) . By merging redundant tokens, we hope to increase throughput, while not necessarily having to train.

Strategy.

In each block of a transformer, we merge tokens to reduce by r per layer. Note that r is a quantity of tokens, not a ratio. Over the L blocks in the network, we gradually merge rL tokens. Varying r gives a speed-accuracy trade-off, as fewer tokens means lower accuracy but higher throughput. Importantly, we reduce rL tokens regardless of the image's content. Some pruning methods dynamically vary the number of tokens (e.g., Kong et al. (2022) ). This increases accuracy but is generally impractical, as it prevents batched inference or training without padding tokens. As shown in Fig. 1 , we apply our token merging step between the attention and MLP branches of each transformer block. This is also in contrast to prior works, which tend to place their reduction method at the beginning of the block instead. Our placement allows information to be propagated from tokens that would be merged and enables us to use features within attention to decide what to merge, both of which increase accuracy (see Tab. 1a). Token Similarity. Before merging similar tokens, we must first define what "similar" means. While it may be tempting to call two tokens similar if the distance between their features is small (as in Marin et al. ( 2021)), this is not necessarily optimal. The intermediate feature space in modern transformers is overparameterized. For instance, ViT-B/16 has enough features to completely encode the rgb pixel



Their throughput is only 1.14-1.25× the baseline because their method can't be parallelized. In their appendix, they show drops of 10-40% accuracy when combining tokens without training.

