TOKEN MERGING: YOUR VIT BUT FASTER

Abstract

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2× the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2× the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2× for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2× the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

1. INTRODUCTION

The introduction of transformers (Vaswani et al., 2017) from NLP to vision with Vision Transformers (ViTs) by Dosovitskiy et al. (2020) has rapidly advanced the field of computer vision. However, unlike in NLP, vision has been since dominated by domain-specific transformer hybrids like Swin (Liu et al., 2021; Dong et al., 2022) using vision-specific attention, MViT (Fan et al., 2021; Li et al., 2022) using vision-specific pooling, or LeViT (Graham et al., 2021) using vision-specific conv modules. The reason for this trend is simple: efficiency. Adding vision-specific inductive biases enables transformer hybrids to perform better with less compute. Yet, vanilla ViTs still have many desirable qualities: they consist of simple matrix multiplications, making them faster than their raw flop count would suggest; they support powerful self-supervised pre-training techniques such as MAE (He et al., 2022) that can put up state-of-the art results while being fast to train; given their lack of assumptions about the data, they can be applied with little or no changes across many modalities (Feichtenhofer et al., 2022; Huang et al., 2022) ; and they scale well with massive amounts of data (Zhai et al., 2021; Singh et al., 2022) , recently obtaining up to 90.94% top-1 on ImageNet (Wortsman et al., 2022) . However, running these massive models can be troublesome, and reproducing these results with a faster architecture would be difficult. A promising subfield of ViTs have recently emerged where, due to the input-agnostic nature of transformers, tokens can be pruned at runtime to enable a faster model (Rao et al., 2021; Yin et al., 2022; Meng et al., 2022; Liang et al., 2022; Kong et al., 2022 ). Yet, token pruning has several disadvantages: the information loss from pruning limits how many tokens you can reasonably reduce; current methods require re-training the model to be effective (some with extra parameters); most cannot be applied to speed up training; and several prune different numbers of tokens depending on the input content, making batched inference infeasible. In this work, we present Token Merging (ToMe) to combine tokens, rather than prune them. Because of our custom matching algorithm, our method is as fast as pruning while being more accurate. Moreover, our method works with or without training, which unlocks its use on huge models with minimal accuracy drop. Using ToMe during training, we observe actual increases in training speed, in some cases cutting the total training time in half. And we apply ToMe without any modifications to images, video, and audio and find it to be competitive with the SotA in all cases. Our contributions are as follows: we introduce a technique to increase the throughput and real-world training speed of ViT models, both with and without training (Sec. 3) and thoroughly ablate our

