MULTIVIZ: TOWARDS VISUALIZING AND UNDERSTANDING MULTIMODAL MODELS

Abstract

The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MULTIVIZ, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MULTIVIZ is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MULTIVIZ together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MULTIVIZ is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community.

1. INTRODUCTION

The recent promise of multimodal models that integrate information from heterogeneous sources of data has led to their proliferation in numerous real-world settings such as multimedia (Naphade et al., 2006 ), affective computing (Poria et al., 2017 ), robotics (Lee et al., 2019 ), and healthcare (Xu et al., 2019) . Subsequently, their impact towards real-world applications has inspired recent research in visualizing and understanding their internal mechanics (Liang et al., 2022; Goyal et al., 2016; Park et al., 2018) as a step towards accurately benchmarking their limitations for more reliable deployment (Hendricks et al., 2018; Jabri et al., 2016) . However, modern parameterizations of multimodal models are typically black-box neural networks, such as pretrained transformers (Li et al., 2019; Lu et al., 2019) . How can we visualize and understand the internal modeling of multimodal information and interactions in these models? As a step in interpreting multimodal models, this paper introduces an analysis and visualization method called MULTIVIZ (see Figure 1 ). To tackle the challenges of visualizing model behavior, we scaffold the problem of interpretability into 4 stages: (1) unimodal importance: identifying the contributions of each modality towards downstream modeling and prediction, (2) cross-modal interactions: uncovering the various ways in which different modalities can relate with each other and the types of new information possibly discovered as a result of these relationships, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction for a given task. In addition to including current approaches for unimodal importance (Goyal et al., 2016; Merrick and Taly, 2020; Ribeiro et al., 2016) and cross-modal interactions (Hessel and Lee, 2020; Lyu et al., 2022) , we additionally propose new methods for interpreting cross-modal interactions, multimodal representations, and prediction to complete these stages in MULTIVIZ. By viewing multimodal interpretability through the lens of these 4 stages, MULTIVIZ contributes a modular and human-in-the-loop visualization toolkit for the community to visualize popular multimodal

annex

What color is the building?1. Unimodal importance Red -0.3

2.. Cross-modal interactions "color"

What color is the building?Local analysis of given datapoint

Global analysis by retrieving similar datapoints

What color is the Salisbury Rd sign?What color are the checkers on the wall?

3.. Multimodal representations

What color is the building? "building" "people" +0.8 +0.4

Red "color"

Figure 1 : Left: We scaffold the problem of multimodal interpretability and propose MULTIVIZ, a comprehensive analysis method encompassing a set of fine-grained analysis stages: (1) unimodal importance identifies the contributions of each modality, (2) cross-modal interactions uncover how different modalities relate with each other and the types of new information possibly discovered as a result of these relationships, (3) multimodal representations study how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction studies how these features are composed to make a prediction. Right: We visualize multimodal representations through local and global analysis. Given an input datapoint, local analysis visualizes the unimodal and cross-modal interactions that activate a feature. Global analysis informs the user of similar datapoints that also maximally activate that feature, and is useful in assigning human-interpretable concepts to features by looking at similarly activated input regions (e.g., the concept of color).datasets and models as well as compare with other interpretation perspectives, and for stakeholders to understand multimodal models in their research domains. MULTIVIZ is designed to support many modality inputs while also operating on diverse modalities, models, tasks, and research areas. Through experiments on 6 real-world multimodal tasks (spanning fusion, retrieval, and question-answering), 6 modalities, and 8 models, we show that MULTIVIZ helps users gain a deeper understanding of model behavior as measured via a proxy task of model simulation. We further demonstrate that MULTIVIZ helps human users assign interpretable language concepts to previously uninterpretable features and perform error analysis on model misclassifications. Finally, using takeaways from error analysis, we present a case study of human-in-the-loop model debugging. Overall, MULTIVIZ provides a practical toolkit for interpreting multimodal models for human understanding and debugging. MULTIVIZ datasets, models, and code are at https: //github.com/pliang279/MultiViz.

2. MULTIVIZ: VISUALIZING AND UNDERSTANDING MULTIMODAL MODELS

This section presents MULTIVIZ, our proposed analysis framework for analyzing the behavior of multimodal models. As a general setup, we assume multimodal datasets take the form2 , ..., y) n i=1 }, with boldface x denoting the entire modality, each x 1 , x 2 indicating modality atoms (i.e., fine-grained sub-parts of modalities that we would like to analyze, such as individual words in a sentence, object regions in an image, or time-steps in time-series data), and y denoting the label. These datasets enable us to train a multimodal model ŷ = f (x 1 , x 2 ; θ) which we are interested in visualizing. Modern parameterizations of multimodal models f are typically black-box neural networks, such as multimodal transformers (Hendricks et al., 2021; Tsai et al., 2019) and pretrained models (Li et al., 2019; Lu et al., 2019) . How can we visualize and understand the internal modeling of multimodal information and interactions in these models? Having an accurate understanding of their decisionmaking process would enable us to benchmark their opportunities and limitations for more reliable real-world deployment. However, interpreting f is difficult. In many multimodal problems, it is useful to first scaffold the problem of interpreting f into several intermediate stages from low-level unimodal inputs to high-level predictions, spanning unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction. Each of these stages provides complementary information on the decision-making process (see Figure 1 ). We now describe each step in detail and propose methods to analyze each step.

2.1. UNIMODAL IMPORTANCE (U)

Unimodal importance aims to understand the contributions of each modality towards modeling and prediction. It builds upon ideas of gradients (Simonyan et al., 2013; Baehrens et al., 2010; Erhan 

