COMPOSING ENSEMBLES OF PRE-TRAINED MODELS VIA ITERATIVE CONSENSUS

Abstract

Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation.

1. INTRODUCTION

Large pre-trained models have shown remarkable zero-shot generalization abilities, ranging from zero-shot image generation and natural language processing to machine reasoning and action planning. Such models are trained on large datasets scoured from the internet, often consisting of billions of datapoints. Individual pre-trained models capture different aspects of knowledge on the internet, with language models (LMs) capturing textual information in news, articles, and Wikipedia pages, and visual-language models (VLMs) modeling the alignments between visual and textual information. While it is desirable to have a single sizable pre-trained model capturing all possible modalities of data on the internet, such a comprehensive model is challenging to obtain and maintain, requiring intensive memory, an enormous amount of energy, months of training time, and millions of dollars. A more scalable alternative approach is to compose different pre-trained models together, leveraging the knowledge from different expert models to solve complex multimodal tasks. Building a unified framework for composing multiple models is challenging. Prior works (Alayrac et al., 2022; Zeng et al., 2022) have explored composing pre-trained models in two main ways: (jointly) finetuning models on large datasets, or using common interfaces such as language to combine Generator(G) Scorers(E)

Iterative Consensus

Video Question Answering Q: How to make the food step by step? A: Put water in the pot, …, add sausage, add seasoning on top of the pizza … Q: What food is being made? A: Make pizza

Grade School Math

Q: A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? A: 3 Q: Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks? A: 7 different models. However, these works have several key limitations: First, simply combining models does not fully utilize each pre-trained model as there is no closed-loop feedback between models. Cascading models, such as Socratic models (Zeng et al., 2022) , allows one-way communication but prevents information processed by later models from propagating back to earlier models to correct errors. Secondly, common interfaces are limited to particular types of models. Language is used as the intermediate connection in Socratic models (Zeng et al., 2022) , but a language interface is insufficient to solve many real-world tasks, such as continuous robot control, which requires continuous representations. In addition, Socratic models require pre-designed language templates for the communication between models, which limits scalability. Thirdly, jointly finetuning multiple models (Alayrac et al., 2022) requires careful optimization to ensure that the model behaviors remain stable. Such models also require intensive memory and large datasets and can only be used for solving specific tasks.

Image

To resolve these difficulties, we propose a unified framework to compose models in a zero-shot mannerfoot_0 without any training/finetuning. Our framework employs a single model as a generator and an ensemble of scorers. The generator iteratively generates proposals, and each scorer provides a feedback score indicating their agreement. The generator refines its outputs until all the scorers achieve a final consensus. This iterative closed-loop communication between the generator and scorers enables models to correct the errors caused by other models, substantially boosting performance. The ensemble of scorers is inspired by the idea of "wisdom of the crowds". Each scorer provides complementary feedback to the generator, compensating for the potential weaknesses of other scorers. A Vision-Language scorer, for example, may correct the biases of a language model. We notice that different pre-trained model instances from the same family have diversity of outputs, which leads to more robust scorers. We demonstrate that guiding the generator with such an ensemble of scorers significantly outperforms a generator guided by a single scorer. To summarize, our work has three main contributions. • First, we propose a unified framework for composing pre-trained models across a variety of tasks, such as image generation, video question answering, mathematical reasoning, and robot manipulation. • Second, we illustrate how the proposed framework can effectively solve zero-shot multimodal tasks without any training/finetuning. The closed-loop communication between the generator and scorers allows the models to interact with each other to improve performance iteratively. • Finally, we illustrate how our framework enables the use of ensembles of different pre-trained models as scorers, significantly improving the zero-shot results by leveraging the strengths of multiple expert models. These observations point to the effectiveness of the proposed method as a general purpose framework for composing pre-trained models for solving various zero-shot multimodal tasks.

3. METHOD

Given a set of large pre-trained models, we aim to utilize the expert knowledge from different models to solve zero-shot multimodal tasks. We separate pre-trained models into two categories -generators (G) such as GPT (Brown et al., 2020; Radford et al., 2019) and Diffusion models (Ho et al., 2020) that can generate candidate solutions, and scorers (E) such as CLIP (Radford et al., 2021) and classifiers that output a scalar score to evaluate each generated solution. We propose PIC (composing ensembles of Pre-trained models via Iterative Consensus), a framework which composes ensembles of pre-trained models for multimodal tasks. The core idea of PIC is to generate solutions through iterative optimization, where we leverage the knowledge from different models to jointly construct a consensus solution. In PIC, a generator G iteratively and sequentially generate candidate solutions, each of which is refined based on the feedback from a set of scorers. In particular, we seek to obtain a solution x * such that x * = arg min x∼G n E n (x), where {E n } is the set of scorers. At each iteration, we refine the solutions to have a lower score than the previous iterations. This procedure, described in Equation ( 1), converges to a solution that minimizes the energy across multiple pre-trained models, which maximizes the agreement between the generator and scorers. In contrast to Socratic Models where different pre-trained models are called sequentially, the closed-loop iterative refinement through which we obtain x * enables the generator and scorers to communicate with each other to reach a consensus on the final solution. Below, we illustrate how PIC can be broadly applied across tasks in image generation, video question answering, grade school math, and robot manipulation. To optimize Equation (1), we consider two different optimization procedures -either a continuous approach that leverages the gradients of each scorer E n (x) or a discrete approach that directly samples possible solutions.

3.1. APPLICATIONS TO ZERO-SHOT TASKS

Image generation. We first apply the proposed framework to image generation to generate images conditioned on a text description or a class label. We use the reverse diffusion process of Published as a conference paper at ICLR 2023 GLIDE (Nichol et al., 2021) , a text-guided diffusion model, as the generator to generate image proposals. At each step of the diffusion process (corresponding to a step of the iterative refinement), we use the gradient from an ensemble of scorers, such as CLIP (Radford et al., 2021) , to guide and update the generated proposals. We iteratively repeat this procedure until the final step. As shown in Fig. 2 (b), the image x k generated at iteration k is first sent to the diffusion model to generate an image proposal xk+1 . Each scorer outputs a score to evaluate whether the generated image matches the given text input. For example, CLIP computes the cosine similarity between the image and text features as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 : x k+1 ← xk+1 + λ∇ x k N n=1 E n θ x k , c , where N is the number of scorers and c is the text label. We denote the reverse process prediction as x k+1 instead of x k-1 (used by most diffusion models) to keep the consistent notation across tasks. Video question answering (VQA). Caption generation for a single video frame is shown in Fig. 2 (c). We use GPT-2 as the generator and multiple different CLIP models, trained with different configurations, as the scorers. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by x t+1 = G(x t , C t ).  C k+1 t ← C k t + λ∇ C k t N n=1 L CLIP (E n θ (x 1 , x 2 , • • • , xt+1 , I)), where k is the step of iterative refinement. After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. We cascade the captions of multiple video frames and questions about this video to prompt GPT-3 for video question answering (See Appendix B.2). Grade school math. We further apply PIC to solve grade school math problems. We use GPT-2 as the generator and treat the grade school math problem as a text generation problem. The scorer, a pre-trained question-solution classifier, provides the generator feedback to guide the next token's generation x t+1 . We follow the approach used in VQA to iteratively optimize the generations based on the feedback from scorers. Our generator G first generates a set of candidate words Xt+1 = {x t+1 }, and then the classifier predicts the probability of each solution (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 ∈ Xt+1 ) matching the given question. The classifier score is the cross-entropy loss between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the classifier score is used to update C t through iterative refinement, same as Eq. ( 3). The updated C t is used to predict the next word x t+1 = G(x t , C t ). We repeat this process until we generate the complete solution. Robot manipulation. Finally, we illustrate how PIC can be applied to manipulate objects in the robot environment to conform to a set of object relations such as "red bowl on top of blue mug" shown in Fig. 2 (d ). We use the combination of Model Predictive Control (MPC) (Williams et al., 2015) and the World Model as the generator. At each time step, we first use MPC to sample a set of possible actions and then render the state images (after executing an action) from multiple camera views using the world model. For each action, the scorer computes a summed score across all camera views as its final score, which is used to select the best action to execute. Thus, in this domain, the ensemble consists of scorers based on different views of the scene. For the generator, we assume that there is a pre-trained model, i.e. world model, that can accurately render and simulate the dynamic changes in the robot world. Since such a large pre-trained model does not directly exist, we approximate it using an environment simulator combined with MPC as the generator. For the scorer, we use the pre-trained ViLD (Gu et al., 2021) to generate segmentation maps for images captured by different camera views n, and the corresponding text label for each segment, which are used to obtain object relations. We compare the generated object relations and the relations specified by the text description to obtain the score, i.e. score equals 0 if they match; otherwise, 1 (here the score means the distance) (see Appendix B.4 for details). To obtain a final world state x T that satisfies the specified relations, and the action sequence {a 1 , • • • , a T } that manipulates the objects into the final state x T , the generator iteratively samples possible actions âk t+1 and gets feedback from scorers. The best action is selected as: a t+1 = arg min âk t+1 N n=1 E n θ (x t , âk t+1 ). Each scorer, E n θ , outputs a score for the resultant state obtained when a candidate action âk t+1 is applied to the current world state x t . We execute a t+1 in the environment and get a new state x t+1 . We repeat this process until the task is accomplished or we are at the final step T .

4. EXPERIMENT SETUP

We evaluate the proposed framework for composing pre-trained models on four representative tasks, including image generation, video question answering, grade school math, and robot manipulation. Image generation. We first show that composing the pre-trained image generator and scorer models such as CLIP enables effective zero-shot image generation. We evaluate the image generation results on ImageNet (Deng et al., 2009) with the image resolution of 64 × 64. The class labels are used as the text input to guide image generation. Each method generates 50 images for each class. We evaluate the image generation quality using Inception Score (IS) (Salimans et al., 2016) Inception Distance (FID) (Heusel et al., 2017) , and Kernel Inception Distance (KID) (Bińkowski et al., 2018) . IS measures the distribution of generated images. Higher values mean the models can generate more distinct images. FID considers the distributions of both generated images and real images. Lower scores represent that the generated images are closer to the real images. KID is similar to FID, measuring the similarity between two data distributions, but is in the kernel space. Video question answering. We evaluate methods for solving VQA tasks on ActivityNet-QA (Yu et al., 2019) . Our method generates free-form language answers instead of selecting an answer from a pre-defined answer set (Yang et al., 2021; Lei et al., 2022) . To evaluate such free-form VQA, we ask workers from Amazon Mechanical Turk to measure whether the generated answer matches the given question and video (See Appendix C for IRB approval and experimental details). For fair comparisons, all the approaches answer the same 300 video questions, and each answer is evaluated by three different workers. The accuracy rate and vocabulary size are reported. An answer is correct if at least two workers believe it is correct. The accuracy rate is the percentage of correctly answered questions over all the questions. To evaluate the diversity of generated answers, we also report the vocabulary size (i.e. the number of words) of answers generated by each method. Grade school math. GSM8K (Cobbe et al., 2021) is a dataset for grade school math problems. Each problem consists of a question, intermediate analyses, and a final solution. We evaluate approaches to solving problems on the 1K test set. We use beam search to generate candidate solutions. The accuracy of beam size 1 and beam size 5 are reported. For beam size of 1, we mark the result as correct if it matches the final solution. For beam size of 5, we mark the result as correct if any of the five generated results matches the solution. Robot manipulation. We next evaluate how pre-trained models may be used to manipulate objects in Ravens (Zeng et al., 2020) . In Ravens, the action space of robot is to drop an object at a 2D location on the table. The goal is to obtain a scene configuration that satisfies the object relations specified by a textual description or a real-world image, such as "blue mug to the left of purple bowl". The task is successful if the object relations in the final state satisfy all the relations specified by the input text or image. We report the success rate of tasks with two and three specified object relations.

5. EXPERIMENTS

We compare the proposed method with baselines on the above four zero-shot tasks.

5.1. IMAGE GENERATION

We evaluate the zero-shot conditional image generation on ImageNet in Table 1 . We first show results of composing a single generator (G) and a single scorer (E). We compose GLIDE (Nichol et al., 2021) with three different types of scorers, respectively. E1 is CLIP (Radford et al., 2021) that computes the cosine similarity between the image and text features as the score, E2 is the image classifier (CLS) (Dhariwal & Nichol, 2021) that predicts the probability of the image matching the text label as the score, and E3 is the classifier-free guidance (CLS-FREE) (Ho & Salimans, 2022) Published as a conference paper at ICLR 2023 which can be treated as an implicit classifier that directly provides pixel-wise gradient feedback to the generated image (Appendix B.1). We then compose the generator with all scorers, i.e. G+E1+E2+E3. Composing the generator and a single scorer allows zero-shot image generation. Composing multiple scorers significantly outperforms a single scorer. We note that the generator is not trained on ImageNet; thus the results in Table 1 cannot be directly compared with methods trained on ImageNet.

5.2. VIDEO QUESTION ANSWERING

Quantitative results. We compare PIC with one of the state-of-the-art VQA approaches, i.e. JustAsk (Yang et al., 2021) , on ActivityNet-QA (Yu et al., 2019) . In Table 2 , JustAsk (FT) is finetuned on ActivityNet-QA, thus achieving the best results. We then compare PIC with JustAsk (Pretrain) for zero-shot VQA. The generator of our method, GPT-2 (medium size), is trained on Webtext (Radford et al., 2019) using the Huggingface library (Wolf et al., 2019) . Our scorers are CLIP models (Radford et al., 2021; Reimers & Gurevych, 2019) trained on different datasets or using different configurations. PIC (G+E1) outperforms JustAsk (Pretrain) by %7.72. Composing more scorers further improves the accuracy by %2.78. In addition, the vocabulary size of answers generated by our method is larger than other approaches, indicating that our method can answer questions using richer language and more diverse phrasing. Note that our method solves a "more challenging" problem than JustAsk (Pretrain) and JustAsk (FT). Our method generates open-language answers while JustAsk (Pretrain) and JustAsk (FT) select an answer from a pre-defined answer set. Generating free-form responses requires both semantic and grammatical correctness. PIC performs well on both these dimensions while also using a richer vocabulary. Qualitative results. In Fig. 3 , we show answers generated by different approaches given a video (only showing a single video frame) and questions. Our approach successfully identifies gender and clothing, but none of the approaches know how to count numbers. Quantitative results. In Table 3 , we compare PIC with two baselines, i.e. GPT-Pretrain and GPT-FT, for solving math problems on GSM8K (Cobbe et al., 2021) . GPT-Pretrain uses the pre-trained GPT-2 (medium size GPT-2 trained on Webtext using Huggingface) to generate numeric strings. GPT-FT is based on GPT-Pretrain and then finetuned on GSM8K. Our method uses the same GPT-2 (Pretrain) as the generator and a question-solution classifier (CLS) as the scorer.

5.3. GRADE SCHOOL MATH

The classifier is trained on GSM8K to distinguish whether a solution is correct for a given question. We surprisingly find that PIC achieves significantly better performance than GPT-FT (%13.344 higher on beam size 1), even though the generator has never seen the math problems before. The classifier Published as a conference paper at ICLR 2023 only provides feedback to the generator, but through iterative refinement, combining a generator and a scorer without joint training is more effective than directly finetuning GPT-2 on GSM8K (we find the overfitting problem when finetuning GPT-2 on GSM8K). Quantitative results. We evaluate the proposed method of manipulating objects to achieve object relations specified by the textual descriptions (Text) or real-world images (Image). In Table 4 , we find that using scorers of multiple camera views substantially improves the accuracy on both settings.

5.4. ROBOT MANIPULATION

Qualitative results. Figure 4 shows the example results of the proposed method manipulating objects to accomplish the given task. Our method enables zero-shot robot manipulation on objects with different sizes, colors, and shapes given either the language goal or image goal.

6. ANALYSIS

PIC exhibits effective zero-shot generalization ability on a variety of tasks. To further understand the source of such generalization, we investigate two key components in PIC, i.e. the composition of multiple scorers (consensus optimization) (Section 6.1) and the iterative refinement (Section 6.2).

6.1. EFFECT OF CONSENSUS OPTIMIZATION

We have shown that composing multiple scorers contributes to zero-shot generalization. We further explore the influence of gradually adding each new scorer on the zeros-shot performance. Image generation. In Table 5 , we first show results of composing GLIDE and the CLIP scorer. We then gradually add a new scorer, the image classifier or classifier-free guidance, each time. Finally, we report the results of composing the generator and all scorers. The performance improves every time we add a new scorer, indicating that composing multiple scorers improves zero-shot performance.  ↑ GPT-Pretrain+E GPT-2 (Medium) (Pretrain) CLS t = T 9.704 GPT-FT+E GPT-2 (Medium) (FT) CLS t = T 14.481 PIC (G+E) GPT-2 (Medium) (Pretrain) CLS t = {1, • • • , T } 17.210 Robot manipulation. In Table 7 , we analyze the effect of composing multiple scores on robot manipulation. The goal is specified by textual descriptions. Composing scores from multiple views, PIC (G+ 3 n=1 E n ) and PIC (G+ 5 n=1 E n ) , leads to higher accuracy.

6.2. EFFECT OF ITERATIVE REFINEMENT

Next, we explore the influence of iterative refinement on zero-shot generalization, i.e. the feedback loop between the generator and scorers. We compare PIC with baselines that compose the generator and scorers, but with the scorers only providing feedback to the generator at the end. Grade school math. In Table 6 , the baselines, GPT-Pretrain+E and GPT-FT+E, generate five proposal solutions of a given math problem. Then the scorer, i.e. the same question-solution classifier used in PIC, selects the best solution based on its score. PIC iteratively refines the generated answer while the baselines refine the entirely generated solutions in the end. PIC and GPT-Pretrain+E use the same generator and scorer, but PIC outperforms GPT-Pretrain+E by %7.507. PIC still achieves better performance than GPT-FT+E, which uses a stronger generator (finetuned on the GSM8K dataset). Robot manipulation. In Table 7 , the baseline, No-IR (G+ 5 n=1 E n ), first samples 100 trajectories without using the feedback from scorers. Then the scorers select the best trajectories based on the summed score. The generator and scorers of this baseline are the same as our method, i.e. PIC (G+ 5 n=1 E n ), but our method outperforms the baseline by %37.5 on the "2 Relations" setting, indicating the effectiveness of iterative refinement in the proposed framework. Together, these results show that the composition of multiple scorers and iterative refinement are both important for zero-shot generalization. These results point to the potential broader applicability of the proposed method as a general purpose framework for zero-shot multimodal tasks.

7. CONCLUSION AND FUTURE WORK

In this paper, we propose a unified framework for composing ensembles of pre-trained models through iterative consensus without any training or finetuning. Our framework consists of a generator and an ensemble of scorers. The scorers provide feedback to the generator to iteratively improve its generated results. We show the proposed method allows effective zero-shot generalization on four representative tasks, i.e. image generation, video question answering, grade school math, and robot manipulation, and even outperforms methods that directly finetune models on certain tasks. We further analyze the source of such zero-shot generalization by exploring the effect of the composition of multiple scorers and the iterative refinement, and find that both are important for zero-shot generalization. As our method does not need any training or finetuning, one drawback is that its performance depends on the pre-trained models. Training large models are complementary to the framework and methods we proposed and may be directly applied. We hope to explore these directions for zero-shot generalization in future work. In addition, our framework enables the composition of separately trained models and boosts performance by leveraging the knowledge from multiple expert models. The scorers can be learned at different times on different data in an incremental-learning manner, enabling the combination of incrementally learned knowledge. Our framework thus paves the way for many potential applications in lifelong learning / continual learning settings.

Appendix

In this appendix, we first show additional results in Appendix A. We then show experimental details of each task in Appendix B. The ethics statement of the Amazon Mechanical Turk experiment for video question answering is in Appendix C.

A.1 IMAGE GENERATION RESULTS

We show more images generation results using different scorer models in Fig. A1 . We find that most of the time, the proposed framework, either using a single scorer model or composing multiple scorer models, work well as shown in the first four examples in the left column of Fig. A1 . In some hard cases, some scorer models might fail (the rest examples in Fig. A1 ). However, there is no discernible trend on which scorer is better on what tasks. The results of composing multiple scorer models are significantly better than using a single one, as different scorer models capture different aspects of the information. For example, in the fifth example in the left column of Fig. A1 , PIC with classifier-free guidance (CLS-FREE) and PIC with a pre-trained classifier (CLS) cannot generate an image with "tench", but PIC with the pre-trained CLIP (CLIP) can generate the correct result and the composed model (CLS-FREE + CLS + CLIP) also works. This is why we consider composing multiple scorers: to leverage the strength of each expert model and improve the worst case. 

A.2 IMAGE GENERATION WITH DIFFERENT GENERATOR

Our framework can be applied to other generators as well. For example, we changed the generator, GLIDE, to Stable Diffusion (Rombach et al., 2021) . The results of composing Stable Diffusion and classifier-free guidance (CLS-FREE) is shown in Table A1 . Using a more powerful pre-trained model can further boost the performance.

A.3 GRADE SCHOOL MATH QUALITATIVE RESULTS

Example results of different methods are shown in Fig. A2 . Our method can solve math problems involving addition, subtraction, multiplication, and division, even for solutions with three-digit numbers. In contrast, GPT-FT often fails to understand math problems. Published as a conference paper at ICLR 2023 

A.4 COMPOSING SCORER MODELS

The key idea of our method is to compose ensembles of pre-trained models, and the way to combine them can be variant. We add two additional experiments in this appendix: (1) using the best scorer (scorer that provides the highest score) and (2) using the weighted scores and add them together. Our method composes pre-trained models without training or finetuning, thus we did not learn separated weights for different models. But instead we add an experiment that uses the scorers (after softmax) of each scorer model as their weights and then composes the scorers using the weighted summed score. We compare these two baselines with our method that uses the summation of all scores. As shown in Table A2 , using a summed score generates the best results. A.5 ADDITIONAL BASELINE COMPARISONS Image Generation. In Table A3 , we compare our approach with a generative model specifically trained on ImageNet, Efficient-VDVAE (Hazami et al., 2022) . Efficient-VDVAE is an unconditional hierarchical VAE model. As illustrated in Table A3 , despite Efficient-VDVAE is explicitly trained on ImageNet, it is much worse than our approach. Robot Manipulation. We compare our robot manipulation method with the robot manipulation model, CLIPort (Shridhar et al., 2022) , used in the Socratic Models paper (Zeng et al., 2022) . In Socratic models, the evaluation of robot manipulation consists of two steps: 1) a GPT model translates a text goal into a set of subgoals and 2) a CLIPort model executes each subgoal. In contrast, in our robot manipulation task, 1) a vision-language model is used to translate an image goal into a set of subgoals and 2) an iterative MPC procedure is used to execute each given subgoal. Thus a fair comparison is to compare the iterative MPC procedure used in our approach and the CLIPort model used in Socratic models. We thus compare our method to the multilingual CLIPort model on the performance of executing a single relation subgoal. CLIPort model obtains a success rate of 22.6% in this setting, while our approach obtains a success rate of 76.3%. Published as a conference paper at ICLR 2023 We use the reverse diffusion process of GLIDE, a text-guided diffusion model, as the generator to generate image proposals. At each step of the diffusion process (corresponding to a step of the iterative refinement), we use the gradient from an ensemble of scorers to guide and update the generated proposals. We iteratively repeat this procedure until the final step. As shown in Fig. A3 , the image x k generated at iteration k is first sent to the diffusion model to generate an image proposal xk+1 . The scorers provide feedback to refine the generated result. The CLIP model computes the cosine similarity between the image and text features as the score (we used the pre-trained CLIP model from (Ho & Salimans, 2022) .). The image classifier (Dhariwal & Nichol, 2021) predicts the probability of the image matching the text label as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 . The classifier-free guidance (Ho & Salimans, 2022) can be treated as an implicit classifier that directly provides pixel-wise gradient feedback to the generated image. Our framework enables the use of ensembles of different pre-trained models as scorers, significantly improving the zero-shot results by leveraging the strengths of multiple expert models. Our implementation for image generation is modified based on the code of GLIDE (Nichol et al., 2021) and the classifier guidance diffusion (Dhariwal & Nichol, 2021) . We use DDIM to sample images from GLIDE in 100 steps. The guidance scale is set to 3.

B.2 VIDEO QUESTION ANSWERING

In video question answering, we use the proposed method to generate captions for the video frames and then use GPT-3 to summarize the captions to answer questions. We use GPT-2 as the generator and a set of CLIP models as scorers to generate captions for each video frame. The CLIP models (Radford et al., 2021; Reimers & Gurevych, 2019) are from the Huggingface library (Wolf et al., 2019 ): • CLIP-32: https://huggingface.co/openai/clip-vit-base-patch32. • CLIP-14: https://huggingface.co/openai/clip-vit-large-patch14. • CLIP-multilingual: https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1. Fig. A4 shows the framework for generating frame captions. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to ZeroCap (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2, such as the embedding functions, K, Q, V , in the Transformer blocks.) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by x t+1 = G(x t , C t ), where G is the pre-trained language model. ZeroCap uses the following loss function to optimize C t : image and text features as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 : arg min Ct L CLIP G (x t , C t ) , I + λL CE G x t , C t , xt+1 , x k+1 xk+1 + r x k N X n=1 E n ✓ x k , c , ( ) where N is the number of scorers and c is the text label. We denote the reverse process prediction as x k+1 instead of x k 1 (used by most diffusion models) to keep consistent notation across tasks. Video question answering (VQA). We first use PIC to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions about this video. Caption generation for a single video frame is shown in Fig. 2 (c ). We use GPT-2 as the generator and multiple different CLIP models, trained with different configurations, as the scorers. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by x t+1 = G(x t , C t ).  C k+1 t C k t + r x N X n=1 L CLIP (E n ✓ (x 1 , x 2 , • • • , xt+1 , I)), where k is the step of iterative refinement. After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. We image and text features as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 : x k+1 xk+1 + r x k N X n=1 E n ✓ x k , c , where N is the number of scorers and c is the text label. We denote the reverse process prediction as x k+1 instead of x k 1 (used by most diffusion models) to keep consistent notation across tasks. Video question answering (VQA). We first use PIC to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions about this video. Caption generation for a single video frame is shown in Fig. 2 (c ). We use GPT-2 as the generator and multiple different CLIP models, trained with different configurations, as the scorers. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by  x t+1 = G(x t , C t ). To update C t , C k+1 t C k t + r x N X n=1 L CLIP (E n ✓ (x 1 , x 2 , • • • , xt+1 , I)), ( ) where k is the step of iterative refinement. After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. We image and text features as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 : x k+1 xk+1 + r x k N X n=1 E n ✓ x k , c , ( ) where N is the number of scorers and c is the text label. We denote the reverse process prediction as x k+1 instead of x k 1 (used by most diffusion models) to keep consistent notation across tasks. Video question answering (VQA). We first use PIC to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions about this video. Caption generation for a single video frame is shown in Fig. 2 (c ). We use GPT-2 as the generator and multiple different CLIP models, trained with different configurations, as the scorers. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by x t+1 = G(x t , C t ).  C k+1 t C k t + r x N X n=1 L CLIP (E n ✓ (x 1 , x 2 , • • • , xt+1 , I)), where k is the step of iterative refinement. After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. We 4 image and text features as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 : x k+1 xk+1 + r x k N X n=1 E n ✓ x k , c , where N is the number of scorers and c is the text label. We denote the reverse process prediction as x k+1 instead of x k 1 (used by most diffusion models) to keep consistent notation across tasks. Video question answering (VQA). We first use PIC to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions about this video. Caption generation for a single video frame is shown in Fig. 2 (c ). We use GPT-2 as the generator and multiple different CLIP models, trained with different configurations, as the scorers. Given a video frame I, we generate a sequence of words to describe it. To integrate feedback from scorers to the generator, similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. The prediction of the next word from the generator G is given by x t+1 = G(x t , C t ).  C k+1 t C k t + r x N X n=1 L CLIP (E n ✓ (x 1 , x 2 , • • • , xt+1 , I)), where k is the step of iterative refinement. After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. We where I is the feature of a video frame and xt+1 is the next word predicted by the original language model. The CLIP loss L CLIP optimizes C t to make the new generated sentence describe the video frame. The second loss L CE ensures the new generated sentence is close to the sentence generated by the original language model. Our implementation is based on the code of ZeroCap (Tewel et al., 2021) . The context cache C t is updated using: C t ←-C t + α ∇ Ct p (x t+1 | C t ) ∥∇ Ct p (x t+1 | C t )∥ 2 , ( ) where p(x t+1 |C t ) is the probability of predicting word x t+1 given C t . Optimizing Eq. (A1) can be achieved by conducting the gradient descent using Eq. (A2). In our experiments, we use 5 steps of gradient descent. The learning rate α is set to 0.3. In the video question answering tasks, we compose multiple CLIP scores and use their composed score to optimize C t : arg min Ct L CLIP-32 G (x t , C t ) , I +L CLIP-14 G (x t , C t ) , I (A3) + L CLIP-multilingual G (x t , C t ) , I +λL CE G x t , C t , xt+1 . (A4) After several iterations, the updated C t is used to generate the next token x t+1 = G(x t , C t ). We repeat this process until we generate the entire caption. To answer the video questions, we cascade the generated captions of the video frames and the questions about this video to prompt GPT-3 to generate answers. For each video, we delete the first Published as a conference paper at ICLR 2023 candidate word A human is making Generator: GPT-2 Context information summed up, and their gradient with respect to x t is used to obtain x t+1 : x t+1 = N (x t+1 + r x N X n=1 E n ✓ (x t , c) , 2 ), ( ) where N is the normal distribution, N is the number of scorers and 2 is the variance. Video question answering (VQA). We first use the proposed framework to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions. As shown in Fig. 2 (c), our framework combines GPT-2 and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. Given a video frame and a text prompt, such as "Image of", we generate a sequence of words to describe the frame. Similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in Transformer (Vaswani et al., 2017) ) that store the context information generated so far. The prediction of the next word can be written as x t+1 = LM (x t , C t ), where LM is the language model (GPT-2). The goal is to update C t iteratively based on the CLIP score to generate the next word such that the sentence is grammatically sound as well as accurately describes the given video frame. To do this, we first use GPT-2 to generate a set of candidate words {x i t+1 }, and then use the feature distance between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xi t+1 }) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss between this clip distribution and the original distribution of the next word obtained from GPT-2. Similar to image generation, the gradient of summed scores (multiple CLIP models) is propagated to GPT-2 to update C t . After several iterations, the updated C t is used to generate the next token x t+1 = LM (x t , C t ). We repeat this process until we generate the entire frame caption. We cascade the video frame captions and questions about this video to prompt GPT-3 for video question answering.

4. pasta a CLIP 1 + … + CLIP N word generated after iterative refinement

Video question answering (VQA). We first use the proposed framework to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions. As shown in Fig. 2 (c), our framework combines GPT-2 and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. Given a video frame and a text prompt, such as "Image of", we generate a sequence of words to describe the frame. Similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in Transformer (Vaswani et al., 2017) ) that store the context information generated so far. The prediction of the next word can be written as x t+1 = LM (x t , C t ), where LM is the language model (GPT-2). The goal is to update C t iteratively based on the CLIP score to generate the next word such that the sentence is grammatically sound as well as accurately describes the given video frame. To do this, we first use GPT-2 to generate a set of candidate words {x i t+1 }, and then use the feature distance between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xi t+1 }) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss between this clip distribution and the original distribution of the next word obtained from GPT-2. Similar to image generation, the gradient of summed scores (multiple CLIP models) is propagated to GPT-2 to update C t . After several iterations, the updated C t is used to generate the next token x t+1 = LM (x t , C t ). We repeat this process until we generate the entire frame caption. We cascade the video frame captions and questions about this video to prompt GPT-3 for video question answering. to generate image proposals. Our method can compose the generator with one or multiple scorers, such as CLIP (Radford et al., 2021 ), text-image classifiers (Dhariwal & Nichol, 2021) , and the classifier-free guidance (Ho & Salimans, 2022) . As shown in Fig. 2 (right), the image x t generated at iteration t is first sent to the GLIDE diffusion model to generate an image proposal xt+1 . Each scorer outputs a score to evaluate whether the generated image matches the given text input. For example, CLIP computes the cosine distance of the image feature and text feature. The text-image classifier predicts a probability of the image matching the text label. The classifier-free guidance can be treated as an implicit classifier that provides pixel-wise gradient feedback to the generator directly. The energy scores generated by different scorers are summed up. We compute the gradient of summed energy score with respect to the original image proposal to update the generated image: x t+1 = x t 2 r x N X n=1 E n ✓ (x t , c) , ( ) where N is the number of scorers. Robot planning. Video Question Answering. We first use the proposed framework to generate video frame captions. We then use GPT-3 (Brown et al., 2020) to summarize the captions and answer questions. As shown in Fig. 3 , our framework combines GPT-2 (Medium size) and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. The history tokens {x 1 , • • • , x t } is first sent to the generator to predict the next token xt+1 . Then the scorers compute the feature distances (scores) between the new sentence (concatenation of history tokens and the new token) and the given video frame. Similar to image generation, the gradient of summed scores are propagated to the generator to update the next token x t+1 . We cascade the video frame captions and questions about this video to prompt GPT-3. Results show that utilizing the proposed framework and GPT-3 enables effective video question answering. Grade school math. We treat the grade school math problem as the text generation problem. Similar to video question answering, the generator is a GPT-2 model (Medium size) and the scorers provide feedback to the generator to guide the generation of next token x t+1 . The scorers can be text classifiers to evaluate the correctness of the output answer for the given math problem (See ??.)

4. EXPERIMENT SETUP

We evaluate the proposed framework for composing large models on four representative zeroshot tasks, including image generation, video question answering, grade school math, and robot manipulation. Image Generation. We first show that composing the image generation model, i.e. GLIDE, and multiple scorer models, i.e. CLIP, text-image classifier, and classifier-free guidance, enables effective zero-shot image generation. We evaluate the image generation results on ImageNet (Deng et al., 4 Figure A4 : Overview of video frame captioning for video question answering. We use GPT-2 as the generator and a set of CLIP models as scorers to generate captions for each video frame. To integrate feedback from scorers to the generator, similar to ZeroCap (Tewel et al., 2021) , we define a context cache Ct (a set of embedding functions in GPT-2) that stores the context information generated so far, which is updated iteratively based on the feedback from scorers. To update Ct, we first use G to generate a set of candidate words Xt+1 = {xt+1}, and then use the feature distance (after softmax) between each sentence (the concatenation of previous words and each new word {x1, x2, • • • , xt+1}, where xt+1 ∈ Xt+1) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss LCLIP between this new probability distribution and the original distribution of the next word obtained from the generator G (see Equation 4in (Tewel et al., 2021) ). The gradient of summed scores (multiple CLIP models) is propagated to G to update Ct (see Equation 5in (Tewel et al., 2021) ). After several iterations, the updated Ct is used to generate the next token xt+1 = G(xt, Ct). We repeat this process until we generate the entire caption. We cascade the captions of multiple video frames and questions about this video to prompt GPT-3 for video question answering. 10 frames and the last 10 frames to remove the beginning or ending advertisements. We then take 30 video frames evenly from the rest frames and send them to GPT-3. To guide GPT-3 to generate proper answers, we randomly select 30 question-answer pairs from the training set of ActivityNet-QA (Yu et al., 2019) and use them as part of the prompt of GPT-3. As shown in Fig. A5 , the prompt of GPT-3 consists of examples of question-answer pairs, the video frame captions generated by the proposed method, and the question about this video that needs to be answered. The text generated by GPT-3 is used as the answer to the question asked. We also used the profanity check tool (https://github.com/vzhou842/profanity-check) to remove the improper answers. Published as a conference paper at ICLR 2023 candidate word A : 2 5 Generator: GPT-2 Context information summed up, and their gradient with respect to x t is used to obtain x t+1 : x t+1 = N (x t+1 + r x N X n=1 E n ✓ (x t , c) , 2 ), where N is the normal distribution, N is the number of scorers and 2 is the variance. Video question answering (VQA). We first use the proposed framework to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions. As shown in Fig. 2 (c), our framework combines GPT-2 and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. Given a video frame and a text prompt, such as "Image of", we generate a sequence of words to describe the frame. Similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in Transformer (Vaswani et al., 2017) ) that store the context information generated so far. The prediction of the next word can be written as x t+1 = LM (x t , C t ), where LM is the language model (GPT-2). The goal is to update C t iteratively based on the CLIP score to generate the next word such that the sentence is grammatically sound as well as accurately describes the given video frame. To do this, we first use GPT-2 to generate a set of candidate words {x i t+1 }, and then use the feature distance between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xi t+1 }) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss between this clip distribution and the original distribution of the next word obtained from GPT-2. Similar to image generation, the gradient of summed scores (multiple CLIP models) is propagated to GPT-2 to update C t . After several iterations, the updated C t is used to generate the next token x t+1 = LM (x t , C t ). We repeat this process until we generate the entire frame caption. We cascade the video frame captions and questions about this video to prompt GPT-3 for video question answering. Video question answering (VQA). We first use the proposed framework to generate video frame captions. We then use GPT-3 to summarize the captions and answer questions. As shown in Fig. 2 (c), our framework combines GPT-2 and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. Given a video frame and a text prompt, such as "Image of", we generate a sequence of words to describe the frame. Similar to (Tewel et al., 2021) , we define a context cache C t (a set of embedding functions in Transformer (Vaswani et al., 2017) ) that store the context information generated so far. The prediction of the next word can be written as x t+1 = LM (x t , C t ), where LM is the language model (GPT-2). The goal is to update C t iteratively based on the CLIP score to generate the next word such that the sentence is grammatically sound as well as accurately describes the given video frame. To do this, we first use GPT-2 to generate a set of candidate words {x i t+1 }, and then use the feature distance between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xi t+1 }) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss between this clip distribution and the original distribution of the next word obtained from GPT-2. Similar to image generation, the gradient of summed scores (multiple CLIP models) is propagated to GPT-2 to update C t . After several iterations, the updated C t is used to generate the next token x t+1 = LM (x t , C t ). We repeat this process until we generate the entire frame caption. We cascade the video frame captions and questions about this video to prompt GPT-3 for video question answering. to generate image proposals. Our method can compose the generator with one or multiple scorers, such as CLIP (Radford et al., 2021 ), text-image classifiers (Dhariwal & Nichol, 2021) , and the classifier-free guidance (Ho & Salimans, 2022) . As shown in Fig. 2 (right), the image x t generated at iteration t is first sent to the GLIDE diffusion model to generate an image proposal xt+1 . Each scorer outputs a score to evaluate whether the generated image matches the given text input. For example, CLIP computes the cosine distance of the image feature and text feature. The text-image classifier predicts a probability of the image matching the text label. The classifier-free guidance can be treated as an implicit classifier that provides pixel-wise gradient feedback to the generator directly. The energy scores generated by different scorers are summed up. We compute the gradient of summed energy score with respect to the original image proposal to update the generated image: x t+1 = x t 2 r x N X n=1 E n ✓ (x t , c) , ( ) where N is the number of scorers. Robot planning. Video Question Answering. We first use the proposed framework to generate video frame captions. We then use GPT-3 (Brown et al., 2020) to summarize the captions and answer questions. As shown in Fig. 3 , our framework combines GPT-2 (Medium size) and multiple CLIP models, trained with different configurations, for zero-shot video frame captioning. The history tokens {x 1 , • • • , x t } is first sent to the generator to predict the next token xt+1 . Then the scorers compute the feature distances (scores) between the new sentence (concatenation of history tokens and the new token) and the given video frame. Similar to image generation, the gradient of summed scores are propagated to the generator to update the next token x t+1 . We cascade the video frame captions and questions about this video to prompt GPT-3. Results show that utilizing the proposed framework and GPT-3 enables effective video question answering. Grade school math. We treat the grade school math problem as the text generation problem. Similar to video question answering, the generator is a GPT-2 model (Medium size) and the scorers provide feedback to the generator to guide the generation of next token x t+1 . The scorers can be text classifiers to evaluate the correctness of the output answer for the given math problem (See ??.) We evaluate the proposed framework for composing large models on four representative zeroshot tasks, including image generation, video question answering, grade school math, and robot manipulation. Image Generation. We first show that composing the image generation model, i.e. GLIDE, and multiple scorer models, i.e. (Tewel et al., 2021) ). The updated Ct is used to predict the next word xt+1 = G(xt, Ct). We repeat this process until we generate the complete solution.

B.3 GRADE SCHOOL MATH

We treat the grade school math problem as a text generation problem. As shown in Fig. A6 , we use GPT-2 as the generator and a pre-trained question-solution classifier as the scorer. The pre-trained classifier is a binary classifier trained on the training set of GSM8K (Cobbe et al., 2021) . Given a math problem, such as "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", and an answer, such as "72". If the answer is correct for the given problem, then the label is 1; otherwise, the label is 0. After training, the classifier is used as the scorer to provide feedback to the generator to guide the next token's generation x t+1 . Similar to VQA, the generator G first generates a set of candidate words Xt+1 = {x t+1 }, and then the classifier predicts the probability of each solution (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 ∈ Xt+1 ) matching the given question. The classifier score is the cross-entropy loss between this new probability distribution and the original distribution of the next word obtained from the generator G (the way to compute the classifier score is the same as computing the CLIP score in VQA). We also used the cross-entropy loss L CE in Equation 2of ZeroCap (Tewel et al., 2021) to ensure the generated sentence is grammatically sound. The context cache C t is updated in the same way as the video question answering task, but we use the classifier score when providing the feedback to C t . The updated C t is used to predict the next word x t+1 = G(x t , C t ). We repeat this process until we generate the complete solution. Similarly to the video question task, we use 5 steps of gradient descent. The learning rate α is set to 0.3.

B.4 ROBOT MANIPULATION

In robot manipulation, we use the proposed method to manipulate objects in Ravens (Zeng et al., 2020) to conform to a set of object relations specified by text descriptions or real-world images. We use MPC+World Model as the generator and ViLD (Gu et al., 2021) as the scorer. As shown in Figure A7 , given a real-world image, our model manipulates objects in the environment to achieve a state with objects having the same object relations as the given image. We first use ViLD to generate a 2D segmentation of the real-world image and the corresponding text label, such as "mug", for each segment. We then use the relative pixel-wise offsets of segmentation masks and the text labels to infer a set of object relations (top panel of Figure A7 ). 

A.4 ROBOT MANIPULATION

In robot manipulation, we use the proposed method to manipulate objects in Ravens (Zeng et al., 2020) to conform to a set of object relations specified by text descriptions or real-world images. We use MPC+World model as the generator and the ViLD (Gu et al., 2021) as the scorer. As shown in Figure 9 , given a real-world image, our model manipulates objects in the environment to achieve a state with objects having the same object relations as the given image. We first use ViLD to generate a 2D segmentation of the real-world image and the corresponding text label, such as "mug", for each segment. We then use the relative pixel-wise offsets of segmentation masks and the text labels to infer a set of object relations (top panel of Figure 9 ). cascade the captions of multiple video frames and questions about this video to prom video question answering. Grade school math. We further apply PIC to solve grade school math problems. We u the generator and treat the grade school math problem as a text generation problem. T pre-trained question-solution classifier, provides the generator feedback to guide the generation xt+1. We follow the approach used in VQA to iteratively optimize the gener on the feedback from scorers. Our generator G first generates a set of candidate words then the classifier predicts the probability of each solution (the concatenation of pre and each new word {x1, x2, • • • , xi t+1 }) matching the given question. The classifier cross-entropy loss between this new probability distribution and the original distributio word obtained from the generator G. The gradient of the classifier score is used to updat iterative refinement. The updated Ct is used to predict the next word xt+1 = G(xt, Ct this process until we generate the complete solution. Robot manipulation. Finally, we illustrate how PIC can be applied to manipulate object environment to conform to a set of object relations such as "red bowl on top of blue mu Fig. 2 (d ). We use the combination of the Model Predictive Control (MPC) (Williams and the World Model as the generator. At each time step, we first use MPC to sample a se actions and then render the state images (after executing an action) from multiple camera the world model. For each action, the scorer computes a summed score across all camera final score, which is used to select the best action to execute. For the generator, we assume that there is a pre-trained model, i.e. world model, that ca render and simulate the dynamic changes in the robot world. Since such a large pre-tr does not directly exist, we approximate it using an environment simulator combined with generator. For the scorer, we use the pre-trained ViLD (Gu et al., 2021) to generate s maps for images captured by different camera views, and the corresponding text la segment, which are used to obtain object relations. We compare the generated object r the relations specified by the text description to obtain the scorer, i.e. score equals 0 if otherwise, 1 (here the score means the distance). To obtain a final world state xT that specified relations, and the action sequence {a1, • • • , aT } that manipulates the objects state xT , the generator iteratively samples possible actions âi t+1 and gets feedback from best action is selected by: at+1 = arg min ât+1 N X n=1 E n ✓ (xt, ât+1 ). Each scorer, E n ✓ , outputs a score for the resultant state obtained when a candidate ac applied to the current world state xt. We execute at+1 in the environment and get a ne We repeat this process until the task is accomplished or we are at the final step T .

4. EXPERIMENT SETUP

We evaluate the proposed framework for composing pre-trained models on four represe including image generation, video question answering, grade school math, and robot m Image generation. We first show that composing the pre-trained image generation mod models such as CLIP enables effective zero-shot image generation. We evaluate the imag results on ImageNet (Deng et al., 2009) with the image resolution of 64 ⇥ 64. The cla used as text input to guide image generation. Each method generates 50 images for ea evaluate the image generation quality using Inception Score (IS) (Salimans et al., 20 Inception Distance (FID) (Heusel et al., 2017) , and Kernel Inception Distance (KID) et al., 2018) . IS measures the distribution of generated images. Higher values mean can generate more distinct images. FID considers both the distribution of generated im distribution of real images. Lower scores represent the generated images are closer to the KID is similar to FID, measuring the similarity between two data distributions but in the Video question answering. We evaluate methods for solving VQA tasks on Activity et al., 2019). Our method generates free-form language answers instead of selecting an a pre-defined answer set (Yang et al., 2021; Lei et al., 2022) . To evaluate such free-for ask workers from Amazon Mechanical Turk to measure whether the generated answer given question and video (See Appendix B for IRB approval and experimental deta Grade school math. We further apply PIC to solve grade school math problems. We use GPT-2 as the generator and treat the grade school math problem as a text generation problem. The scorer, a pre-trained question-solution classifier, provides the generator feedback to guide the next token's generation xt+1. We follow the approach used in VQA to iteratively optimize the generations based on the feedback from scorers. Our generator G first generates a set of candidate words {x i t+1 }, and then the classifier predicts the probability of each solution (the concatenation of previous words and each new word {x1, x2, • • • , xi t+1 }) matching the given question. The classifier score is the cross-entropy loss between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the classifier score is used to update Ct through iterative refinement. The updated Ct is used to predict the next word xt+1 = G(xt, Ct). We repeat this process until we generate the complete solution. Robot manipulation. Finally, we illustrate how PIC can be applied to manipulate objects in the robot environment to conform to a set of object relations such as "red bowl on top of blue mug" shown in Fig. 2 (d ). We use the combination of the Model Predictive Control (MPC) (Williams et al., 2015) and the World Model as the generator. At each time step, we first use MPC to sample a set of possible actions and then render the state images (after executing an action) from multiple camera views using the world model. For each action, the scorer computes a summed score across all camera views as its final score, which is used to select the best action to execute. For the generator, we assume that there is a pre-trained model, i.e. world model, that can accurately render and simulate the dynamic changes in the robot world. Since such a large pre-trained model does not directly exist, we approximate it using an environment simulator combined with MPC as the generator. For the scorer, we use the pre-trained ViLD (Gu et al., 2021) to generate segmentation maps for images captured by different camera views, and the corresponding text label for each segment, which are used to obtain object relations. We compare the generated object relations and the relations specified by the text description to obtain the scorer, i.e. score equals 0 if they match; otherwise, 1 (here the score means the distance). To obtain a final world state xT that satisfies the specified relations, and the action sequence {a1, • • • , aT } that manipulates the objects into the final state xT , the generator iteratively samples possible actions âi t+1 and gets feedback from scorers. The best action is selected by: at+1 = arg min ât+1 N X n=1 E n ✓ (xt, ât+1). Each scorer, E n ✓ , outputs a score for the resultant state obtained when a candidate action ât+1 is applied to the current world state xt. We execute at+1 in the environment and get a new state xt+1. We repeat this process until the task is accomplished or we are at the final step T . We evaluate the proposed framework for composing pre-trained models on four representative tasks, including image generation, video question answering, grade school math, and robot manipulation. Image generation. We first show that composing the pre-trained image generation model and scorer models such as CLIP enables effective zero-shot image generation. We evaluate the image generation results on ImageNet (Deng et al., 2009) with the image resolution of 64 ⇥ 64. The class labels are used as text input to guide image generation. Each method generates 50 images for each class. We evaluate the image generation quality using Inception Score (IS) (Salimans et al., 2016) , Fréchet Inception Distance (FID) (Heusel et al., 2017) , and Kernel Inception Distance (KID) (Bińkowski et al., 2018) . IS measures the distribution of generated images. Higher values mean the models can generate more distinct images. FID considers both the distribution of generated images and the distribution of real images. Lower scores represent the generated images are closer to the real images. KID is similar to FID, measuring the similarity between two data distributions but in the kernel space. Video question answering. We evaluate methods for solving VQA tasks on ActivityNet-QA (Yu et al., 2019) . Our method generates free-form language answers instead of selecting an answer from a pre-defined answer set (Yang et al., 2021; Lei et al., 2022) . To evaluate such free-form VQA, we ask workers from Amazon Mechanical Turk to measure whether the generated answer matches the given question and video (See Appendix B for IRB approval and experimental details). For fair 5

… Action sampled in different iterations

Under review as a conference paper at ICLR 2023 cascade the captions of multiple video frames and questions about this video to prompt GPT-3 for video question answering. Grade school math. We further apply PIC to solve grade school math problems. We use GPT-2 as the generator and treat the grade school math problem as a text generation problem. The scorer, a pre-trained question-solution classifier, provides the generator feedback to guide the next token's generation xt+1. We follow the approach used in VQA to iteratively optimize the generations based on the feedback from scorers. Our generator G first generates a set of candidate words {x i t+1 }, and then the classifier predicts the probability of each solution (the concatenation of previous words and each new word {x1, x2, • • • , xi t+1 }) matching the given question. The classifier score is the cross-entropy loss between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the classifier score is used to update Ct through iterative refinement. The updated Ct is used to predict the next word xt+1 = G(xt, Ct). We repeat this process until we generate the complete solution. Robot manipulation. Finally, we illustrate how PIC can be applied to manipulate objects in the robot environment to conform to a set of object relations such as "red bowl on top of blue mug" shown in Fig. 2 (d ). We use the combination of the Model Predictive Control (MPC) (Williams et al., 2015) and the World Model as the generator. At each time step, we first use MPC to sample a set of possible actions and then render the state images (after executing an action) from multiple camera views using the world model. For each action, the scorer computes a summed score across all camera views as its final score, which is used to select the best action to execute. For the generator, we assume that there is a pre-trained model, i.e. world model, that can accurately render and simulate the dynamic changes in the robot world. Since such a large pre-trained model does not directly exist, we approximate it using an environment simulator combined with MPC as the generator. For the scorer, we use the pre-trained ViLD (Gu et al., 2021) to generate segmentation maps for images captured by different camera views, and the corresponding text label for each segment, which are used to obtain object relations. We compare the generated object relations and the relations specified by the text description to obtain the scorer, i.e. score equals 0 if they match; otherwise, 1 (here the score means the distance) (see Appendix A.4 for details). To obtain a final world state xT that satisfies the specified relations, and the action sequence {a1, • • • , aT } that manipulates the objects into the final state xT , the generator iteratively samples possible actions âk t+1 and gets feedback from scorers. The best action is selected by: at+1 = arg min 

4. EXPERIMENT SETUP

We evaluate the proposed framework for composing pre-trained models on four representative tasks, including image generation, video question answering, grade school math, and robot manipulation. Image generation. We first show that composing the pre-trained image generation model and scorer models such as CLIP enables effective zero-shot image generation. We evaluate the image generation results on ImageNet (Deng et al., 2009) with the image resolution of 64 ⇥ 64. The class labels are used as text input to guide image generation. Each method generates 50 images for each class. We evaluate the image generation quality using Inception Score (IS) (Salimans et al., 2016) , Fréchet Inception Distance (FID) (Heusel et al., 2017) , and Kernel Inception Distance (KID) (Bińkowski et al., 2018) . IS measures the distribution of generated images. Higher values mean the models can generate more distinct images. FID considers both the distribution of generated images and the distribution of real images. Lower scores represent the generated images are closer to the real images. KID is similar to FID, measuring the similarity between two data distributions but in the kernel space. Video question answering. We evaluate methods for solving VQA tasks on ActivityNet-QA (Yu et al., 2019) . Our method generates free-form language answers instead of selecting an answer from 5

Goal object relations

Figure A7: Overview of robot manipulation. We use MPC+World Model as the generator and ViLD as the scorer to manipulate objects to conform to a set of object relations specified by text descriptions or real-world images. Top: given a real-world image, we first use ViLD to generate a 2D segmentation of the real-world image and the corresponding text label, such as "mug", for each segment. We then use the relative pixel-wise offsets of segmentation masks and the text labels to infer a set of object relations. Bottom: Given the current world state xt, we aim to generate an action at+1 so that the new world state after executing at+1 has object relations closer to the object relations in the given image. To do this, we first use the generator (MPC+World model) to generate a set of candidate actions {â k t+1 } and the corresponding world states {x k t+1 } after executing each candidate action. For each new world state xk t+1 , we render N 2D images from N camera views. Each rendered image is sent to VILD to get a segmentation map and text labels. We project the objects into 3D space based on the segmentation map and the depth map of the image. We then obtain the object relations based on their 3D positions and predicted text labels. We compare the object relations obtained from each rendered image and the object relations obtained from the real-world image to compute the score. The score is 0 if the relations are matching; otherwise, 1. We sum the scores from each rendered image to obtain the final score. We choose the action at+1 that leads to a world state with the minimum summed score. We execute at+1 in the environment and get a new state xt+1. We repeat this process until the task is accomplished or we are at the final step T . Given the current world state x t , we aim to generate an action a t+1 so that the new world state after executing a t+1 has object relations closer to the object relations in the given image. To do this, we first use the generator (MPC+World Model) to generate a set of candidate actions {â k t+1 } and the corresponding world states {x k t+1 } after executing each candidate action. For each new world state xk t+1 , we render N 2D images from N camera views. Each rendered image is sent to VILD to get a segmentation map and text labels. We project the objects into 3D space based on the segmentation map and the depth map of the image. We then obtain the object relations based on their 3D positions and the predicted text labels. We compare the object relations obtained from each rendered image and the object relations obtained from the real-world image to compute the score. The score is 0 if the relations are matching; otherwise, 1. We sum the scores from each rendered image to obtain the final score. We choose the action a t+1 that leads to a world state with the minimum summed score. We execute a t+1 in the environment and get a new state x t+1 . We repeat this process until the task is accomplished or we are at the final step T , where T equals to the number of relations extracted from the real-world image. Workers are shown a video, three questions, and the answer to each question. The answers are generated by different methods. The workers are not told which method generates each answer. The workers are asked to select "yes" or "no" based on their measurement of whether the answer is correct for the given video and question.



By zero-shot, we mean the composed models are never trained together on the evaluation task.



Figure1: The proposed framework that composes a "generator" and an ensemble of "scorers" through iterative consensus enables zero-shot generalization across a variety of multimodal tasks.

Figure 2: The proposed unified framework and examples on three representative tasks. (a) Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. (b) Image generation. A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator. (c) Video question answering. GPT-2 is used as the generator, and a set of CLIP models are used as scorers. (d) Robot manipulation. MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action. Orange lines represent the components used to refine the generated result.

Figure 3: Video question answering example results. Our approach successfully identifies gender and clothing, but its failure to count objects is a reflection of GPT-2 and CLIP's inability to count.

Figure 4: Robot manipulation example results. The robot manipulates objects to achieve certain object relations that are specified by textual descriptions (first row) or real-world images (second row).

Figure A1: Qualitative results. Image generation results using different scorer models. Composing multiple scorers (CLS-FREE + CLS + CLIP) achieves the best performances.

Figure 2: The proposed unified framework and examples on three representative tasks. (a) Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. (b) Image generation. A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator. (c) Video question answering. GPT-2 is used as the generator, and a set of CLIP models are used as scorers. (d) Robot manipulation. MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action. Orange lines represent the components used to refine the generated result.

To update C t , we first use G to generate a set of candidate words Xt+1 = {x t+1 }, and then use the feature distance (after softmax) between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 2 Xt+1 ) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss L CLIP between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the summed score (multiple CLIP models) is then propagated to G to update C t :

Figure 2: The proposed unified framework and examples on three representative tasks. (a) Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. (b) Image generation. A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator. (c) Video question answering. GPT-2 is used as the generator, and a set of CLIP models are used as scorers. (d) Robot manipulation. MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action. Orange lines represent the components used to refine the generated result.

we first use G to generate a set of candidate words Xt+1 = {x t+1 }, and then use the feature distance (after softmax) between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 2 Xt+1 ) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss L CLIP between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the summed score (multiple CLIP models) is then propagated to G to update C t :

Figure 2: The proposed unified framework and examples on three representative tasks. (a) Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. (b) Image generation. A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator. (c) Video question answering. GPT-2 is used as the generator, and a set of CLIP models are used as scorers. (d) Robot manipulation. MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action. Orange lines represent the components used to refine the generated result.

To update C t , we first use G to generate a set of candidate words Xt+1 = {x t+1 }, and then use the feature distance (after softmax) between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 2 Xt+1 ) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss L CLIP between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the summed score (multiple CLIP models) is then propagated to G to update C t :

Figure 2: The proposed unified framework and examples on three representative tasks. (a) Overview of the proposed unified framework. Dashed lines are omitted for certain tasks. (b) Image generation. A pre-trained diffusion model is used as the generator, and multiple scorers, such as CLIP and image classifiers, are used to provide feedback to the generator. (c) Video question answering. GPT-2 is used as the generator, and a set of CLIP models are used as scorers. (d) Robot manipulation. MPC+World model is used as the generator, and a pre-trained image segmentation model is used to compute the scores from multiple camera views to select the best action. Orange lines represent the components used to refine the generated result.

To update C t , we first use G to generate a set of candidate words Xt+1 = {x t+1 }, and then use the feature distance (after softmax) between each sentence (the concatenation of previous words and each new word {x 1 , x 2 , • • • , xt+1 }, where xt+1 2 Xt+1 ) and the video frame as the probability of them matching. The CLIP score is the cross-entropy loss L CLIP between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the summed score (multiple CLIP models) is then propagated to G to update C t :

FigureA3: Overview of image generation. We use the reverse diffusion process of GLIDE(Nichol et al., 2021), a text-guided diffusion model, as the generator to generate image proposals. At each step of the diffusion process (corresponding to a step of the iterative refinement), we use the gradient from an ensemble of scorers, such as CLIP(Radford et al., 2021), to guide and update the generated proposals. The image x k generated at iteration k is first sent to the diffusion model to generate an image proposal xk+1 . The scorers provide feedback to refine the generated result. The CLIP model computes the cosine similarity between the image and text features as the score. The image classifier(Dhariwal & Nichol, 2021)  predicts the probability of the image matching the text label as the score. The scores generated by different scorers are summed, and their gradient with respect to x k is used to compute the next reverse prediction x k+1 . Classifier-free guidance(Ho & Salimans, 2022) can be treated as an implicit classifier that directly provides pixel-wise gradient feedback to the generated image. We iteratively repeat this procedure until the final step. Our framework enables the use of ensembles of different pre-trained models as scorers, significantly improving the zero-shot results by leveraging the strengths of multiple expert models.

Figure 3: Details 2.

Q: how many people are there in the video # A: 2 # Q: what is behind the person in white clothes # A: tree # Q: what is in front of the person with braid # A: chair ... # Q: what is the person in white doing # A: tie hair # Q: what happened to the person in gray after he threw a goal # A: clap with your teammates # Summarize the following descriptions and answer the question as shown above: a Video showing the new Hair tutorial; a video showing young blond hair clip attaching to top pony tail of teens hair; …; a video on the head hair clip website showing blonde long hair twisted in two knots. # Q: is the person with a golden hair long hair

Figure A5: Prompt given to GPT-3 for video question answering. Text in black contains the question-answer pairs randomly sampled from the ActivityNet-QA training dataset. Text in blue has the video frame captions generated by the proposed method. Text in orange is the question about this video that needs to be answered.

Figure 3: Details 2.

. We cascade the captions of multiple video frames and questions about this video to prompt GPT-3 for video question answering. # Q: how many people are there in the video # A: 2 # Q: what is behind the person in white clothes # A: tree # Q: what is in front of the person with braid # A: chair ... # Q: what is the person in white doing # A: tie hair # Q: what happened to the person in gray after he threw a goal # A: clap with your teammates # Summarize the following descriptions and answer the question as shown above: a Video showing the new Hair tutorial; a video showing young blond hair clip attaching to top pony tail of teens hair; …; a video on the head hair clip website showing blonde long hair twisted in two knots. # Q: is the person with a golden hair long hair

Figure 7: Prompt given to GPT-3 for video question answering. Text in black contains the question-answer pairs randomly sampled from the ActivityNet-QA dataset. Text in blue has the video frame captions generated by the proposed method. Text in orange is the question about this video that needs to be answered.

Given the current world state xt, we aim to generate an action at+1 so that the new world state after executing at+1 has object relations the same as object relations in the given image. To do this, we first use the generator (MPC+World model) to generate a set of candidate actions {â k t+1 }

of multiple video frames and questions about this video to prompt GPT-3 for video question answering.

scorer, E n ✓ , outputs a score for the resultant state obtained when a candidate action âk t+1 is applied to the current world state xt. We execute at+1 in the environment and get a new state xt+1. We repeat this process until the task is accomplished or we are at the final step T .

Figure A9: Screenshot of Amazon Mechanical Turk we used for the video question answering experiment.

, FréchetPublished as a conference paper at ICLR 2023 Image generation results on ImageNet. Our PIC can compose the pre-trained generator (G) and scorers (E) through iterative optimization. Composing multiple scorers further boosts performance.

Video question answering results on ActivityNet-QA. JustAsk (FT) is finetuned on ActivityNet-QA, thus achieving the best results. For zero-shot VQA, our method (PIC) significantly outperforms JustAsk (Pretrain), one of the best VQA methods. Using multiple scorers further improves the performance.

Grade school math results on GSM8K.

Robot manipulation results on Ravens.PIC can manipulate objects to achieve object relations specified by textual descriptions (Text) or real-world images (Image). Using scorers of multiple camera views substantially improves the success rate.

Effect

Effect of iterative refinement. Grade school math results on GSM8K. PIC with iterative refinement outperforms baselines where the scorer only provides feedback to the generator at the end stage. BS is the beam search size.

Effect

Our framework can be applied to other generators as well. In the image generation task, we use a new generator, Stable-Diffusion(Rombach et al., 2021). Using a more powerful pre-trained model can further boost the performance. Image generation results on ImageNet are reported. Melanie is a door-to-door saleswoman. She sold a third of her vacuum cleaners at the green house, 2 more to the red house, and half of what was left at the orange house. If Melanie has 5 vacuum cleaners left, how many did she start with? : A fog bank rolls in from the ocean to cover a city. It takes 10 minutes to cover every 3 miles of the city. If the city is 42 miles across from the oceanfront to the opposite inland edge, how many minutes will it take for the fog bank to cover the whole city?

Different ways to compose scorer models. Composing scorers using their summed score generates the best results. Image generation results on ImageNet are reported.

Comparison of our method and additional baselines. Image generation results on ImageNet are reported. Our method outperforms the baseline.

CLIP, text-image classifier, and classifier-free guidance, enables effective zero-shot image generation. We evaluate the image generation results on ImageNet(Deng et al.,    Overview of solving grade school math problems. We use GPT-2 as the generator and treat the grade school math problem as a text generation problem. The scorer, a pre-trained question-solution classifier, provides the generator feedback to guide the next token's generation xt+1. We follow the approach used in VQA to iteratively optimize the generations based on the feedback from scorers. Our generator G first generates a set of candidate words Xt+1 = {xt+1}, and then the classifier predicts the probability of each solution (the concatenation of previous words and each new word {x1, x2, • • • , xt+1}, where xt+1 ∈ Xt+1) matching the given question. The classifier score is the cross-entropy loss between this new probability distribution and the original distribution of the next word obtained from the generator G. The gradient of the classifier score is used to update Ct through iterative refinement (see Equation5in

annex

Published as a conference paper at ICLR 2023 

B.5 A UNIFIED FRAMEWORK FOR COMPOSING PRE-TRAINED MODELS

Our method shares some similar architecture with existing works, such as ZeroCap (Tewel et al., 2021) and CLIP-guided diffusion models (Nichol et al., 2021) . However, the focus of our paper is to propose a general framework for composing different pre-trained models across a variety of tasks, and these particular methods are concrete instantiations of our proposed framework. In addition, in this work, we also illustrate how we may combine ensembles of different pre-trained models as scorers to leverage the "wisdom of the crowds" where each scorer provides complementary feedback to the generator, compensating for the potential weaknesses of other scorers. Through iterative optimization and the composition of multiple scorers, our method shows effective zero-shot generalization ability on various multimodal tasks.

C ETHICS STATEMENT OF AMAZON MECHANICAL TURK EXPERIMENTS

To evaluate approaches on solving the zero-shot video question answering tasks, we ask workers from Amazon Mechanical Turk to evaluate the generated answer based on the video and the asked question. Before showing the questions and answers to the workers, we used the profanity check tool (https://github.com/vzhou842/profanity-check) to remove the improper questions and answers. As shown in Fig. A8 , this experiment was approved by the Committee on the Use of Humans as Experimental Subjects. A screenshot of the task is shown in Fig. A9 . The instructions shown to participants are listed as follows:Instructions: By making judgments about these questions and answers, you are participating in a study being performed by [XXX] . Your participation in this research is voluntary. You may decline further participation, at any time, without adverse consequences. Your anonymity is assured; the researchers who have requested your participation will not receive any personal information about you.Given a video, a question, and a generated answer, the workers from Amazon Mechanical Turk measure whether the answer is correct for the given question and video. Each video shows three question-answer pairs (only one question-answer pair is shown in the screenshot). The answers are generated by different methods. The workers are not told which method generates each answer. The workers are asked to choose "yes" or "no". If the worker thinks the answer matches the given video and question, they should choose "yes"; otherwise, "no".To control the quality, each task is evaluated by three different workers. The workers are required to have an approval rate greater than 98%. Our test shows that each task takes around 10 seconds, but the workers are given up to one hour to complete each task. The workers are paid $0.05 for finishing each task with an estimated hourly payment of $18, more than the United States federal minimum wage. There are 33 workers in total who joined our experiment.

