DIFFERENTIABLE MATHEMATICAL PROGRAMMING FOR OBJECT-CENTRIC REPRESENTATION LEARNING

Abstract

We propose topology-aware feature partitioning into k disjoint partitions for given scene features as a method for object-centric representation learning. To this end, we propose to use minimum s-t graph cuts as a partitioning method which is represented as a linear program. The method is topologically aware since it explicitly encodes neighborhood relationships in the image graph. To solve the graph cuts our solution relies on an efficient, scalable, and differentiable quadratic programming approximation. Optimizations specific to cut problems allow us to solve the quadratic programs and compute their gradients significantly more efficiently compared with the general quadratic programming approach. Our results show that our approach is scalable and outperforms existing methods on object discovery tasks with textured scenes and objects

1. INTRODUCTION

Object-centric representation learning aims to learn representations of individual objects in scenes given as static images or video. Object-centric representations can potentially generalize across a range of computer vision tasks by embracing the compositionality inherent in visual scenes arising from the interaction of mostly independent entites. (Burgess et al., 2019; Locatello et al., 2020; Elsayed et al., 2022) . One way to formalize object-centric representation learning is to consider it as an input partitioning problem. Here we are given a set of spatial scene features, and we want to partition the given features into k per-object features, or slots, for some given number of objects k. A useful requirement for a partitioning scheme is that it should be topology-aware. For example, the partitioning scheme should be aware that points close together in space are often related and may form part of the same object. A related problem is to match object representations in two closely related scenes, such as frames in video, to learn object permanence across space and time. In this paper we focus on differentiable solutions for the partitioning and matching problems that are also efficient and scalable for object-centric learning. We formulate the topology-aware k-part partitioning problem as the problem of solving k minimum s-t cuts in the image graph (see Figure 1 ) and the problem of matching as a bipartite matching problem. An interesting feature of the minimum s-t cut and bipartite matching problems is that they can both be formulated as linear programs. We can include such programs as layers in a neural network by parameterizing the coefficients of the objective function of the linear program with neural networks. However, linear programs by themselves are not continuously differentiable with respect to the objective function coefficients (Wilder et al., 2019) . A greater problem is that batch solution of linear programs using existing solvers is too inefficient for neural network models, especially when the programs have a large number of variables and constraints. We solve these problems by 1) approximating linear programs by regularized equality constrained quadratic programs, and 2) precomputing the optimality condition (KKT matrix) factorizations so that optimality equations can be quickly solved during training. The advantage of using equality constrained quadratic programs is that they can be solved simply from the optimality conditions. Combined with the appropriate precomputed factorizations for the task of object-centric learning, the optimality conditions can be solved very efficiently during training. Algorithm 1 Feature k-Part Partitioning Require: Input features x of dimension C × H × W with C channels, height H and width W 1: Compute quadratic program parameters y i = f i y (x), for i ∈ {1, . . . k} where f i q are CNNs. 2: Optionally transform spatial features x f = f x (x), where f x is an MLP transform acting on the channel dimension. x f has dimension D × H × W . 3: Solve regularized quadratic programs for minimum s-t cut and extract vertex variables z i = qsolve(y i ) for each y i . Each z i has dimension H × W . 4: Normalize z i across cuts i = 1, ..., k for each pixel with a temperature-scaled softmax. 5: Multiply z i with x f along H, W for each i to obtain K masked features maps r i . 6: Return r i as the k-partition. A second advantage of using quadratic programming approximations is that quadratic programs can be differentiated relative to program parameters using the implicit function theorem as shown in prior literature (Barratt, 2018; Amos and Kolter, 2017) . To learn the objective coefficients of the cut problem by a neural network, the linear program needs to be solved differentiably, like a hidden layer. For this, we can relax the linear program to a quadratic program and employ techniques from differentiable mathematical programming (Wilder et al., 2019; Barratt, 2018) to obtain gradients. This amounts to solving the KKT optimality conditions which then result in the gradients relative to the parameters of the quadratic program (Amos and Kolter, 2017; Barratt, 2018) . However, with a naive relaxation, the required computations for both the forward and backward pass are still too expensive for use in object-centric representation learning applications. Given that the techniques generally employed for differentiably solving quadratic programs are limited to smaller program sizes (Amos and Kolter, 2017), we introduce optimizations in the gradient computation specific to the problem of solving graph cuts for image data. For instance, we note that the underlying s-t flow graph remains unchanged across equally-sized images, allowing us to pre-compute large matrix factorization. Furthermore, we replace the forward pass by a regularized equality constrained quadratic program constructed from the linear programming formulation of the minimum s-t cut problem. When combined with these task specific optimizations, equality constrained quadratic programs can be solved significantly more efficiently than general quadratic programs with mixed equality and inequality constraints (Wright and Nocedal, 1999) . The regularization of slack variables ensures that the output of the new quadratic program can still be interpreted as an s-t cut solution. The use of sparse matrix computations in the forward and backward passes ensures that time and memory usage is significantly reduced. To summarize, we make the following contributions in this paper. 1. We formulate object-centric representation learning in terms of partitioning and matching. 2. We propose s-t cuts in graphs for topology-aware partitioning with neural networks. 3. We propose to use regularized equality constrained quadratic programs as a differentiable, general, efficient and scalable scheme for solving partitioning and matching problems with neural networks.

2. MINIMUM s-t CUTS FOR TOPOLOGY-AWARE PARTITIONING

We first describe the general formulations of the graph partitioning problem specialized for images and then describe limitations when considering graph partitioning in image settings. With these limitations in mind, we describe the proposed neural s-t cut and matching algorithms, which allow for efficient and scalable solving of graph partitioning and matching. Last, we describe how to learn end-to-end object-centric representations for static and moving objects with the proposed methods.

2.1. MINIMUM s-t GRAPH CUTS

The problem of finding minimum s-t cuts in graphs is a well-known combinatorial optimization problem closely related to the max-flow problem (Kleinberg and Tardos, 2005) . We are given a directed graph G = (V, E) with weights for edge (u, v) denoted by w u,v and two special vertices

