TOKEN MERGING: YOUR VIT BUT FASTER

Abstract

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2× the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2× the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2× for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2× the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

1. INTRODUCTION

The introduction of transformers (Vaswani et al., 2017) from NLP to vision with Vision Transformers (ViTs) by Dosovitskiy et al. (2020) has rapidly advanced the field of computer vision. However, unlike in NLP, vision has been since dominated by domain-specific transformer hybrids like Swin (Liu et al., 2021; Dong et al., 2022) using vision-specific attention, MViT (Fan et al., 2021; Li et al., 2022) using vision-specific pooling, or LeViT (Graham et al., 2021) using vision-specific conv modules. The reason for this trend is simple: efficiency. Adding vision-specific inductive biases enables transformer hybrids to perform better with less compute. Yet, vanilla ViTs still have many desirable qualities: they consist of simple matrix multiplications, making them faster than their raw flop count would suggest; they support powerful self-supervised pre-training techniques such as MAE (He et al., 2022) that can put up state-of-the art results while being fast to train; given their lack of assumptions about the data, they can be applied with little or no changes across many modalities (Feichtenhofer et al., 2022; Huang et al., 2022) ; and they scale well with massive amounts of data (Zhai et al., 2021; Singh et al., 2022) , recently obtaining up to 90.94% top-1 on ImageNet (Wortsman et al., 2022) . However, running these massive models can be troublesome, and reproducing these results with a faster architecture would be difficult. A promising subfield of ViTs have recently emerged where, due to the input-agnostic nature of transformers, tokens can be pruned at runtime to enable a faster model (Rao et al., 2021; Yin et al., 2022; Meng et al., 2022; Liang et al., 2022; Kong et al., 2022 ). Yet, token pruning has several disadvantages: the information loss from pruning limits how many tokens you can reasonably reduce; current methods require re-training the model to be effective (some with extra parameters); most cannot be applied to speed up training; and several prune different numbers of tokens depending on the input content, making batched inference infeasible. In this work, we present Token Merging (ToMe) to combine tokens, rather than prune them. Because of our custom matching algorithm, our method is as fast as pruning while being more accurate. Moreover, our method works with or without training, which unlocks its use on huge models with minimal accuracy drop. Using ToMe during training, we observe actual increases in training speed, in some cases cutting the total training time in half. And we apply ToMe without any modifications to images, video, and audio and find it to be competitive with the SotA in all cases. Our contributions are as follows: we introduce a technique to increase the throughput and real-world training speed of ViT models, both with and without training (Sec. 3) and thoroughly ablate our design choices (Sec. 4.1); we perform extensive experiments on images with several ViT models (Sec. 4.2) and compare to state-of-the-art in architecture design and token pruning methods (Sec. 4.3); we then repeat these experiments for both video (Sec. 5) and audio (Sec. 6) and find ToMe works well across modalities; and we visualize our results and find ToMe merges parts of objects on images (Fig. 4 ) and objects over their entire range of motion on video (Fig. 6 ). We hope ToMe can enable the creation of more powerful, faster ViT models.

2. RELATED WORK

Efficient Transformers. Several works have attempted to create more efficient transformers in both NLP and Vision. Some focus on faster attention (Choromanski et al., 2020; Shen et al., 2021; Dao et al., 2022; Wang et al., 2020; Bolya et al., 2022) , some attempt to prune heads or features (Meng et al., 2022; Voita et al., 2019; Michel et al., 2019) , and some attempt to infuse domainspecific modules (Mehta & Rastegari, 2021; Graham et al., 2021; Liu et al., 2021; 2022a; Dong et al., 2022) . In this paper, we focus on speeding up existing ViT models by merging tokens to match the speed-accuracy trade-off of more complicated domain-specific models, sometimes without training. Token Reduction. Since transformers can operate with any number of tokens, several recent works have attempted to prune the tokens from transformers in both NLP (Goyal et al., 2020; Kim & Cho, 2020; Kim et al., 2021; Lassance et al., 2021) and Vision (Meng et al., 2022; Yin et al., 2022; Kong et al., 2022; Song et al., 2022; Rao et al., 2021; Fayyaz et al., 2022; Yu & Wu, 2021) . However, these methods require training, while our method can be used without training. Moreover, most pruning works are dynamic, i.e., the number of tokens varies between images or sentences. While this benefits accuracy it limits practicality, as samples with different numbers of tokens can no longer be batched. To solve this, most pruning papers apply a mask during training rather than remove tokens, which negates the speed-up from pruning. Our method, on the other hand, can be applied during both inference and training, achieving real-world speed-ups in either case. Combining Tokens. While plenty of works prune tokens, very few combine them. Kong et al. (2022) and Liang et al. (2022) combine what they prune into a single token. GroupViT (Xu et al., 2022) , while not focused on efficiency, groups tokens using cross-attention for semantic segmentation. TokenLearner (Ryoo et al., 2021) uses an MLP to reduce the number of tokens. LIT (Pan et al., 2022) learns deformable token merging layers for pooling between stages. Token Pooling (Marin et al., 2021) is the most similar to our token merging but uses a slow kmeans-based approachfoot_0 that doesn't work on an off-the-shelf modelfoot_1 . Until now, no approach has been successful in offering a reasonable speed-accuracy trade-off when combining tokens without training.

3. TOKEN MERGING

Our goal is to insert a token merging module into an existing ViT (Dosovitskiy et al., 2020) . By merging redundant tokens, we hope to increase throughput, while not necessarily having to train.

Strategy.

In each block of a transformer, we merge tokens to reduce by r per layer. Note that r is a quantity of tokens, not a ratio. Over the L blocks in the network, we gradually merge rL tokens. Varying r gives a speed-accuracy trade-off, as fewer tokens means lower accuracy but higher throughput. Importantly, we reduce rL tokens regardless of the image's content. Some pruning methods dynamically vary the number of tokens (e.g., Kong et al. (2022) ). This increases accuracy but is generally impractical, as it prevents batched inference or training without padding tokens. As shown in Fig. 1 , we apply our token merging step between the attention and MLP branches of each transformer block. This is also in contrast to prior works, which tend to place their reduction method at the beginning of the block instead. Our placement allows information to be propagated from tokens that would be merged and enables us to use features within attention to decide what to merge, both of which increase accuracy (see Tab. 1a). Token Similarity. Before merging similar tokens, we must first define what "similar" means. While it may be tempting to call two tokens similar if the distance between their features is small (as in Marin et al. (2021) ), this is not necessarily optimal. The intermediate feature space in modern transformers is overparameterized. For instance, ViT-B/16 has enough features to completely encode the rgb pixel Luckily, transformers natively solve this problem with QKV self-attention (Vaswani et al., 2017) . Specifically, the keys (K) already summarize the information contained in each token for use in dot product similarity. Thus, we use a dot product similarity metric (e.g., cosine similarity) between the keys of each token to determine which contain similar information (see Tab. 1a, 1b). Bipartite Soft Matching. With token similarity defined, we need a fast way to determine which tokens to match in order to reduce the total number by r. There are several potential solutions to this problem, such as kmeans clustering (Lloyd, 1982) or graph cuts (Boykov et al., 2001) . But we perform this matching L times within the network on potentially thousands of tokens, so its runtime has to be absolutely negligible. This is very much not the case for most iterative clustering algorithms. Thus, we propose a more efficient solution. Our design goals are as follows: 1.) we want to avoid anything iterative that cannot be parallelized and 2.) we want the changes merging makes to be gradual. The latter is why we focus on matching and not clustering, as clustering places no bounds on how many tokens can be merged into one group (which may adversely affect the network) , whereas matching leaves most of the tokens unmerged. Our algorithm is as follows (visualized in Fig. 1 ): 1. Partition the tokens into two sets A and B of roughly equal size. 2. Draw one edge from each token in A to its most similar token in B. 3. Keep the r most similar edges. 4. Merge tokens that are still connected (e.g., by averaging their features). 5. Concatenate the two sets back together. Because this creates a bipartite graph and each token in A has only one edge, finding connected components in step 4 is trivial. Moreover, we don't need to compute similarity between every pair of tokens which, if we choose A and B carefully, isn't a problem for accuracy (see Tab. 1e). In fact, this "bipartite soft matching" is nearly as fast as just dropping tokens randomly (see Tab. 2) and takes only a few lines of code to implement (see Appendix D). Tracking Token Size. Once tokens are merged, they no longer represent one input patch. This can change the outcome of softmax attention: if we merge two tokens with the same key, that key has less effect in the softmax term. We can fix this with a simple change, denoted proportional attention: A = softmax QK ⊤ √ d + log s (1) where s is a row vector containing the size of each token (number of patches the token represents). This performs the same operation as if you'd have s copies of the key. We also need to weight tokens by s any time they would be aggregated, like when merging tokens together (see Tab. 1d). Training with Merging. (He et al., 2022) on ImageNet-1k evaluated off-the-shelf without training, using r = 8. The baseline model without ToMe obtains 85.96% acc at 93.3 im/s. For each ablation, we report Top-1 accuracy (acc) and fp32 model throughput (im/s) on a V100 GPU. Our default settings are marked in purple . reduce accuracy drop or to speed up training. To train, we simply treat token merging as a pooling operation and backprop through the merged tokens as if we were using average pooling. We don't find a need to use any gradient tricks such as Gumbel softmax (Jang et al., 2017) as in token pruning (e.g., Kong et al. (2022) ). And in fact, we find that the same settings used in training a vanilla ViT are also optimal here (see Appendix B). Thus ToMe is a drop-in replacement to increase training speed.

4. IMAGE EXPERIMENTS

We perform several experiments on ImageNet-1k (Deng et al., 2009) using ViT models trained in four different ways: AugReg (Steiner et al., 2022) , MAE (He et al., 2022) , SWAG (Singh et al., 2022) , and DeiT (Touvron et al., 2021) . For all experiments, we either run the model off-the-shelf with our method or, in the case of MAE and DeiT, trained with our method applied. All throughputs are measured during inference on a V100 GPU with optimal batch size and fp32 unless noted otherwise.

4.1. DESIGN CHOICES

In Tab. 1, we ablate the design choices made in our approach. For each ablation, we start from our default parameters marked in purple. Unless otherwise noted, we test on an off-the-shelf ViT-L/16 MAE model without training (acc: 85.96%, im/s: 93.3). and merge with r = 8 which gradually removes 98% of tokens over the 24 layers of the network. Token Similarity. The tokens' features (X) are not the best in terms of performance (Tab. 1a). Moving the merging operation after attention (X vs. X pre ) and using the attention keys (K) is significantly more accurate. Then, cosine similarity is best to measure token distance as shown in Tab. 1b. Finally, we average K over the attention heads instead of concatenating them (Tab. 1c) for efficiency.

Algorithmic Choices.

After deciding what tokens to merge, we combine them by averaging weighted by token size, s (see Eq. 1). In Tab. 1d, this outperforms keeping just the token in B, max pooling, or unweighted average pooling. Then, our bipartite matching algorithm requires splitting the input into two disjoint sets. Because we concatenate the sets afterward, we find that assigning tokens by alternating to work the best (Tab. 1e). Filling A and then filling B (sequentially) performs the worst. Proportional Attention. Once merged, tokens can represent more than one input patch. We address this with proportional attention (Eq. 1), which we ablate in Tab. 1f. Surprisingly, proportional attention is necessary for supervised models (e.g., AugReg, SWAG, DeiT), but not for MAE models. Comparing Matching Algorithms. In Tab. 2 we compare our bipartite matching to different token reduction algorithms, both pruning and merging. Pruning is fast, but with 98% of the tokens removed overall, important information is lost. This is true for both pruning randomly and pruning based on what isn't attended to (Kim et al., 2021) . In contrast, merging tokens only loses information when dissimilar tokens are merged. Thus, it's important to correctly select similar tokens to merge. At first, kmeans (Lloyd, 1982) may seem like the obvious choice, but on top of being slow it's only slightly better than pruning. While it may minimize reconstruction error, kmeans allows a large number of tokens to match to the same cluster, which increases the probability of dissimilar tokens being merged. Marin et al. (2021) study several faster clustering algorithms based on kmeans, but they are unable to obtain a better than 10% accuracy drop in their setting without training. Instead, we want a matching algorithm that only merges the most similar tokens. We could do this greedily by merging the most similar pair of tokens and then repeating without replacement r times. This is accurate but sequential and thus could get slow with large r. Our bipartite matching has the accuracy of this greedy approach and the speed of pruning while having constant runtime w.r.t. to r. Selecting a Merging Schedule. By default, we merge tokens with a constant schedule, i.e. r per layer. To evaluate the optimality of this design we randomly sample a total of 15,000 merging schedules. For each schedule, we test its accuracy and fp16 throughput on ImageNet-1k val using an off-the-shelf AugReg ViT-B/16 model. In Fig. 2 , we plot the results of this experiment and find that a constant schedule is close to optimal, especially as the total tokens merged increases. We further analyze the best random samples (see Appendix C) and find that a linearly decreasing schedule works well at throughputs up to ∼3x. Thus, we also define a "decreasing" schedule that removes 2r tokens in the first layer and 0 tokens in the last layer, linearly interpolating for the rest. This also removes rL tokens, but is faster because more are removed early: Constant Schedule x per layer denoted r x ➙ (2) Decreasing Schedule 2x → 0 per layer denoted r x ➘ (3) 4.2 MODEL SWEEP In Fig. 3 , we apply our token merging method to 11 SotA off-the-shelf ViT models from various sources. For each model, we vary r with a constant schedule to construct throughput vs. accuracy curves, starting with 0 (no merging baseline) to r such that we run out of tokens to merge. We evaluate each model off-the-shelf. That is, by default we don't train; we just download the model and change a few lines of code. Models are evaluated on 224px images unless otherwise noted. Supervised Models. Both AugReg (Steiner et al., 2022) and SWAG (Singh et al., 2022) are ViT models pretrained on a large supervised (or weakly supervised) pretraining dataset and fine-tuned on ImageNet-1k. AugReg covers optimal ViT training up until ViT-L/16, while SWAG pushes the limits of ImageNet by training huge models with massive image sizes. We apply our method off-the-shelf on AugReg models in Fig. 3a and SWAG models in Fig. 3b . (a) AugReg Models. A collection of ImageNet-21k pretrained models (Steiner et al., 2022) . (b) SWAG Models. Massive weakly supervised models pretrained on 3.6B images (Singh et al., 2022) . (c) MAE Models. Self-supervised models pretrained on ImageNet-1k (He et al., 2022) . Immediately, we can see that a constant schedule gives up to 2× the throughput no matter the model. And even though we're compressing 96-98% of the tokens in each, the largest models have barely any accuracy drop: while ViT-B, S, and Ti all have around 4-5% accuracy drop at 2× speed, ViT-L only suffers a 2% drop on 224px images and a 0.7% drop on 384px images with AugReg. Similarly, with SWAG models, ViT-L on 512px images and ViT-H on 518px images both have small 0.3% accuracy drop without training. Note this trend is not just because larger models have more tokens, since we always reduce the number of tokens by 96-98%. Instead, we think this is because large models are deeper and thus allow for more gradual change in features, which lessens the impact of merging. Self-Supervised Models. MAE (He et al., 2022 ) is a self-supervised pretraining method for ViT with models pretrained and fine-tuned on ImageNet-1k. In Fig. 3c we apply our method both off-the-shelf and trained by fine-tuning the public pretrained checkpoints. When fine-tuning, we find that we can use the original training recipes. We don't have to to compensate for fewer tokens in later layers (see Appendix B), likely because our method is already tuned to imitate a model without merging. Comparison to State of the Art. In Tab. 3, we compare our MAE fine-tuned models to state-of-theart models trained on ImageNet-1k without extra data: EfficientNet (Tan & Le, 2019) , Swin (Liu et al., 2021) , SwinV2 (Liu et al., 2022a) , CSWin (Dong et al., 2022) , and MViTv2 (Li et al., 2022) . All throughputs are on a single V100. Note that we use MAE pretraining which is not supported for all transformers, but provides accuracy improvement for some like Swin/SwinV2. Thus we also include SimMIM (Xie et al., 2022) pre-trained Swin and SwinV2 models for comparison. Nevertheless, token merging with ToMe improves the throughput of ViT models such that ViT-L and ViT-H become comparable in speed to models of a lower tier, without scaling the number of features. Thus we display results of ViT "advancing a tier" in Tab. 3. More testing is need to see whether applying ToMe and model scaling at the same time would produce even better results. Comparison to Token Pruning. In Tab. 4, we compare ToMe to token pruning works that use DeiT-S trainingfoot_2 : A-ViT (Yin et al., 2022) , DynamicViT (Rao et al., 2021) , and SP-ViT (Kong et al., 2022) with throughput measured on a V100. Even though we don't use gradient tricks such as gumbel softmax (Jang et al., 2017) , add extra parameters, or use additional training tricks, we can already match the performance and exceed the throughput of existing much more complicated token pruning works. Moreover, most token pruning works are forced to use padding tokens or attention masking during training, negating the benefits of pruning in the first place. Our method, on the other hand, doesn't suffer from this issue and we observe a 1.5× training speedup with DeiT. Interestingly, after 300 epochs the DeiT models have a similar accuracy drop to our MAE trained ViT-L (Appendix A). But we actually don't need to train at all: if we take an off-the-shelf AugReg ViT-S model and apply the same merging schedule, we can match the performance of the DeiT models without training.

4.4. VISUALIZATIONS

In Fig. 4 , we show the input patches belonging to each merged token at the end of the network. We find that applying ToMe results in token merging that resembles something not unlike part segmentation (Chen et al., 2014) . In the second image, the husky has different tokens for its legs, body, and face. The monkey in the 3rd image has different tokens for its hand, body, face, eyes, and mouth while the orange it's holding gets its own token despite it representing just one input patch. In cases where there are multiple instances of the same class like the dung beetles in the fourth image and the Boston terriers in the last image, the same parts from all instances get merged together. Notably unlike pruning, ToMe is able to merge a large number of tokens both in the background and the foreground without losing information. See more results and methodology in Appendix E.

5. VIDEO EXPERIMENTS

This framework of MAE plus token merging is a powerful strategy across several domains. Because of its high redundancy, one of the most promising directions is video. Thus, we apply our token merging approach on Spatiotemporal MAE (Feichtenhofer et al., 2022) for video classification on Kinetics-400 (Kay et al., 2017) , both by simply applying our method off-the-shelf without training and by applying our method during MAE fine tuning with the default training recipe like we did for images. Note that nothing in our method needs to change for video: we use the same code for both. Results. In Tab. 5, we show the results of applying our method off-the-shelf and during MAE fine-tuning using ViT-L from Spatiotemporal MAE compared to the relevant state-of-the-art on Kinetics-400 classification: Swin (Liu et al., 2022b) pretrained on ImageNet-21k, MViTv2 (Li et al., 2022) pretrained with MaskFeats (Wei et al., 2022) , and Spatiotemporal MAE as the baseline. We also include a token pruning work, X-ViT + ATS (Fayyaz et al., 2022) , for completeness. Amazingly, ToMe applied to ViT-L with a constant schedule can match the throughput of Swin-B while performing better than MViTv2-L, even when evaluated without training. Moreover, with a decreasing schedule, ViT-L MAE r65➘ significantly outperforms the baseline ViT-B MAE model with the same flop count with or without training, meaning ToMe is better than model scaling here. Training is not necessary for a constant schedule, but it does help with a decreasing schedule. Throughput. In Tab. 6, we display the throughput and training time of our method applied to ViT-L. With a constant schedule, we can increase throughput by 2.2× for a negligible 0.2% accuracy drop. Moreover, this setting cuts training time in half, even with the overhead of syncing across 8 gpus. Clip Count. Because each forward pass only sees up to 2 seconds of video, it's standard practice to evaluate video recognition models with multiple clips. In Tab. 5, we evaluate with multiple clips (1 spatial crop, 10 temporal crops). We don't factor the number of clips into flop count because this is a hyperparameter every method can tune, usually resulting in only small differences as long as a minimum number of clips are used (i.e., 4 in this case). Thus, we choose the same number of clips as other models to compare. However, this might compensate for the information loss from token merging. In Fig. 5 , we test if this is the case by sweeping over the number of clips for our method compared to the baseline ViT-L model. For r = 65, we see some degradation compared to the 4 clip sweet-spot (∼0.5%), but for lower r values, there's no decrease compared to the baseline. Visualization. We visualize the final tokens for each input patch over multiple frames of video in Fig. 6 using our trained ViT-L MAE r65➙ model. Just like ToMe performs primitive part segmentation on images, it is actually able to perform primitive part tracking on video. The same object or part is merged into one across multiple frames of video like the ball in Fig. 6 . Note that the extraneous red patch in the third frame is the reflection of the ball in the glass. More results in Appendix E. Drop path randomly drops out entire attention and MLP blocks with some probability. This has the effect of regularizing layers so that they don't rely on a single block. Because we use the K matrix from blocks that could be dropped out, we test the value of this parameter. Again, we find this not necessary to change. We also perform the same experiments on video except with just layer decay and the number of epochs, testing whether ToMe requires increasing the number of epochs (due to seeing fewer tokens overall). And again, the default parameters work the best. (b) Video Fine-Tuning. We test some additional parameters for ViT-L spatiotemporal MAE fine-tuning on K400 with r = 65. The defaults also work best here. Accuracy is for 3x5 evaluation. Note that this fine-tuning started from a MAE pre-trained model trained for half its schedule, so these numbers aren't comparable to the numbers in Tab. 5. Table 14 : Training Hyperparameters don't need to be updated when training with token merging. We perform a sweep over relevant hyperparmaters that might be affected by token merging for images and video. The default setting, marked in purple , already has the highest accuracy.

C MERGING SCHEDULE

Figure 7 : Merging Schedule. The average of the top 100 merging schedules for the experiment in Fig. 2 . For each throughput range, we find the highest accuracy schedules and average their number of tokens merged per layer. For lower throughputs, merging more tokens at the end is better. For higher throughputs, a constant schedule becomes the best. Then for even higher throughputs, a linearly decreasing schedule works well. In Fig. 7 , we plot the average number of tokens merged in each layer for the most accurate random samples in Fig. 2 . Around throughputs of 1600-1800, the best schedule is close to constant, which is why constant is close to optimal in this range. For throughputs beyond that, however, a decreasing schedule is best. For this, reason we define a linearly decreasing schedule in addition to a constant schedule in the main paper. 



Their throughput is only 1.14-1.25× the baseline because their method can't be parallelized. In their appendix, they show drops of 10-40% accuracy when combining tokens without training. This comparison is difficult as many token pruning works use different training strategies, some even claiming improvement in accuracy without a valid baseline. A-ViT fine-tunes on top of DeiT, while DynamicViT starts DeiT training from an existing checkpoint. We, on the other hand, train from scratch.



Figure 1: Token Merging. (a) With ToMe, similar patches are merged in each transformer block: for example, the dog's fur is merged into a single token. (b) ToMe is simple and can be inserted inside the standard transformer block. (c) Our fast merging algorithm, see Appendix D for implementation.

Figure 2: Token Merging Schedule. Our default constant merging schedule is close to optimal when compared to 15k randomly sampled merging schedules on an AugReg ViT-B/16.

Figure 3: Model Sweep. We apply ToMe to several state of the art ViT models off-the-shelf, i.e.without training, varying r to produce fp32 throughput vs. accuracy curves on ImageNet-1k. Immediately, we can see that a constant schedule gives up to 2× the throughput no matter the model. And even though we're compressing 96-98% of the tokens in each, the largest models have barely any accuracy drop: while ViT-B, S, and Ti all have around 4-5% accuracy drop at 2× speed, ViT-L only suffers a 2% drop on 224px images and a 0.7% drop on 384px images with AugReg. Similarly, with SWAG models, ViT-L on 512px images and ViT-H on 518px images both have small 0.3% accuracy drop without training. Note this trend is not just because larger models have more tokens, since we always reduce the number of tokens by 96-98%. Instead, we think this is because large models are deeper and thus allow for more gradual change in features, which lessens the impact of merging.

Figure 4: Image Visualizations. Results of merging on ImageNet-1k val using a ViT-H MAE r7➙ model trained with ToMe. Patches with the same inner and border color are merged together. Unlike pruning, ToMe can merge similar parts of the image whether they're in the foreground or background.

Figure 8: More visualization on images. Continuation of Fig. 4.

Figure 9: More visualization on video. Continuation of Fig. 6. In each clip, we highlight an instance or part being merged into one token across frames (red). Clips are from Kinetics-400 val.

Token Merging ablation experiments using ViT-L/16 from MAE

Matching Algorithm. Different matching algorithms with the same settings as Tab. 1. Our bipartite algorithm is almost as fast as randomly pruning tokens, while retaining high accuracy. Matching is more suited for this setup than pruning or clustering.

Advancing a Tier. Comparison to SoTA models trained only on ImageNet-1k. Our method allows for the use of more complicated models for the same tier of throughput.

ToMe vs. pruning methods on ViT-S trained from scratch with DeiT. We time Dy-namicViT on our V100, but the others cannot be batched and evaluate throughput in different settings. Pruning methods require token padding during training, and thus don't improve training speed, while ToMe at r = 13 trains 1.5× faster than the baseline DeiT-S. We also obtain the same result without training on an off-the-shelf AugReg ViT-S model. blue indicates ToMe applied without training while gray indicates ToMe applied during training. † A-ViT is not trained from scratch and performs slightly better than DynamicViT in its setting. See Appendix A for DeiT-Ti results.

Full Video Off-the-Shelf Results. Results are without training. The original model is listed in gray. We include the blue models in Tab. 5. Evaluation is 1 × 10 (this is not factored into the flop count). We include both top-1 and top-5 accuracy on Kinetics-400.

Full Audio Results. Results are with and without training. The original baseline model (left) and the baseline we train (right) are listed in gray. We include the blue and gray models in Tab. 7.

Image Fine-Tuning. We sweep over ViT-B/16 MAE fine-tuning on ImageNet-1k with r = 16 and find that the default parameters from the official code release work the best.

annex

And as expected, in Fig. 3c , we see the same trends as before: except this time, with training we can bring the error down to 0.4% for ViT-H, 0.6% for ViT-L, and 1.7% for ViT-B at 2× throughput. Our approach actually implicitly addresses an issue in MAE: because MAE removes tokens during pretraining, its epoch times are ∼4× faster than training a supervised model. However, normal fine-tuning uses all the tokens and doesn't have this benefit. Our token merging method fixes this issue and allows for roughly ∼2× faster epochs at negligible accuracy drops for large models. This suggests that one could train even bigger models with token merging than was possible before.Re-evaluating. Note that, while in Fig. 3c we train a new model for each value of r, this isn't actually necessary. Instead, we can take a model trained with one value of r and re-evaluated it with another. In fact, it's possible to actually improve performance by doing so. For instance, the baseline ViT-L model we train in Fig. 3c gets 85.7% accuracy. If we re-evaluate our r = 5 trained model with r = 0, we obtain 85.8% accuracy. Thus, it's feasible to speed up training with ToMe and not apply it during evaluation to produce the same or better results. This means that, while the result of applying ToMe in Fig. 3b and Fig. 3c are similar to e.g. scaling the model size, you only have to train one model with ToMe to create any a model with a large range of scales.

4.3. COMPARISON TO OTHER WORKS

In this section, we compare our trained token merging models to other state-of-the art works on ImageNet-1k, both in terms of the overall vision space as well as other token reduction methods. 

6. AUDIO EXPERIMENTS

We perform experiments on an Audio MAE (Huang et al., 2022) , where a spectrogram of the audio signal is rasterized and then fed into a ViT model. We use the ViT-B model from Huang et al. (2022) and evaluate on AudioSet-2M (Gemmeke et al., 2017) . Results. Note, the metric reported is mAP instead of accuracy because of class imbalance. Due to training implementation differences, the baseline model we train has lower mAP than originally reported in Huang et al. (2022) . Thus in Tab. 7, we compare ToMe without training to the original number, and ToMe with training to our trained baseline. Regardless, on audio we obtain an almost 2× throughput increase with an mAP drop of only 0.4%. Full results for this experiment are in Appendix A.

7. CONCLUSION

In this work, we introduced Token Merging (ToMe) to increase the throughput of ViT models by gradually merging tokens. ToMe naturally exploits redundancy in the input, allowing its use for any modality with redundancy. In the paper we explore extensive experiments on images, video, and audio, obtaining speeds and accuracies competitive with the state-of-the-art in each case.ToMe can be viewed as a "natural" hierarchical model, similar to Swin or MViT but using pure transformer blocks. ToMe could be combined with these methods to create an entirely new type of architecture. Similarly, we focus on classification but our visualizations show potential on tasks like segmentation. Finally, ToMe works well on large models across domains and cuts down training time and memory usage, meaning it could be a core component of training huge models. We leave these as topics for future work and hope ToMe can lead to the creation of better, more efficient transformers.

A FULL RESULTS

Results for plots and tables in the main paper. For all results, im/s indicates throughput and "speed" indicates improvement over the baseline. All throughputs are measured on a V100, but the actual values may differ a little from the main paper as the model may have been benchmarked on a different machine. However, all results in the main paper use the same machine for throughput evaluation.

A.1 IMAGES

For each ImageNet-1k model, we display our full results here.

A.1.1 AUGREG MODELS

Full results listed in Tab. 8. We make no special changes to any of these models. The original off-the-shelf models are listed in gray. We mention the blue models in the abstract.

A.1.3 MAE MODELS

We evaluate MAE models both off the shelf and trained with ToMe in Tab. 10. For off-the-shelf evaluation we disable proportional attention as noted in Sec. 4.1, but we enable it for the trained models. Note that we compare to baselines we trained ourselves, which may slightly underperform the official baselines (for ViT-L). When training, we fine-tune from the official pretrained weights and use the original training recipe. Unlike prior work, we intend for ToMe to replace standard training, not augment it, in order to receive the benefits of faster training times and less memory usage.

A.1.4 DEIT MODELS

We present DeiT results in Tab. 11. For DeiT, we train from scratch with the default training recipe for 300 epochs. Unlike other token pruning works, we don't use any tricks such as starting from an existing checkpoint or fine-tuning. Note that for merging, in addition to not merging the class token, we don't merge the distillation token. In Tab. 11, we don't train for all values of r, just the baseline r = 0 and those between 8 and 16. r = 11 for DeiT-S didn't finish training.

A.2 VIDEO

We run the ViT-L model from Feichtenhofer et al. (2022) off the shelf. In Tab. 12, we show the results of this experiment by sweeping over r. For each setting, we evaluate with 1 spatial crop and 10 temporal clips. Note that the original baseline is evaluated with 3 spatial crops and 7 temporal clips, while we re-evaluated it with 1 × 10. Thus, the baseline has slightly lower accuracy than the original paper. Like with images, for these off-the-shelf MAE pretrained models we don't use proportional attention. The original baseline models we train are listed in gray. We include the gray models in Tab. 3.

A.3 AUDIO

Full results for our audio experiments can be found in Tab. 13. We used the model from Huang et al. (2022) to evaluate off-the-shelf. However, for training we train with our own implementation that's different from the paper. For this reason, in Tab. 13, we list two different baselines (one from the original paper, and the other trained by us). In this case, we don't use proportional attention during off-the-shelf evaluation or training.

B HYPERPARAMETERS

In Tab. 14, we perform a limited hyperparameter search on parameters that would be affected by applying ToMe: layer decay, drop path, and the number of epochs.Layer decay reduces learning rate based on the layer of the network. Since ToMe gradually reduces the number of tokens, the size of gradient updates in later layers might already be lower without layer decay. However, we find that it's not necessary to change that parameter. 

D IMPLEMENTATION

The following is an implementation of our "bipartite soft matching" in PyTorch (Paszke et al., 2019) : This returns a lambda function that can be applied to any matrix or vector (i.e. to merge features, to calculate token size, or to calculate source patches). Note how this is done all at once in parallel-there are no sequential loops.

E MORE VISUALIZATION

To create the visualizations in Fig. 4 and Fig. 6 , we follow each final merged token back to its original input patches. Then for each token, we color its input patches with the average color in that region.To make sure different tokens are distinct from each other, we also assign each token a random border color. Note that tokens do not necessarily represent contiguous input regions. The only spatial signal ToMe has comes from the position encodings.In Fig. 8 , we present several more examples of merging on images as a continuation of Fig. 4 . ToMe's propensity for part and object segmentation appears time and time again across many different images.In Fig. 9 , we also display more results of ToMe performing object tracking on video. Note that in (Feichtenhofer et al., 2022) , each token represents more than one frame. Namely, the patch size is 2 × 16 × 16 and thus 2 frames of video correspond to each token. We plot the first frame from the two, because we find that more closely matches the merged tokens.

