DIFFERENTIABLE MATHEMATICAL PROGRAMMING FOR OBJECT-CENTRIC REPRESENTATION LEARNING

Abstract

We propose topology-aware feature partitioning into k disjoint partitions for given scene features as a method for object-centric representation learning. To this end, we propose to use minimum s-t graph cuts as a partitioning method which is represented as a linear program. The method is topologically aware since it explicitly encodes neighborhood relationships in the image graph. To solve the graph cuts our solution relies on an efficient, scalable, and differentiable quadratic programming approximation. Optimizations specific to cut problems allow us to solve the quadratic programs and compute their gradients significantly more efficiently compared with the general quadratic programming approach. Our results show that our approach is scalable and outperforms existing methods on object discovery tasks with textured scenes and objects

1. INTRODUCTION

Object-centric representation learning aims to learn representations of individual objects in scenes given as static images or video. Object-centric representations can potentially generalize across a range of computer vision tasks by embracing the compositionality inherent in visual scenes arising from the interaction of mostly independent entites. (Burgess et al., 2019; Locatello et al., 2020; Elsayed et al., 2022) . One way to formalize object-centric representation learning is to consider it as an input partitioning problem. Here we are given a set of spatial scene features, and we want to partition the given features into k per-object features, or slots, for some given number of objects k. A useful requirement for a partitioning scheme is that it should be topology-aware. For example, the partitioning scheme should be aware that points close together in space are often related and may form part of the same object. A related problem is to match object representations in two closely related scenes, such as frames in video, to learn object permanence across space and time. In this paper we focus on differentiable solutions for the partitioning and matching problems that are also efficient and scalable for object-centric learning. We formulate the topology-aware k-part partitioning problem as the problem of solving k minimum s-t cuts in the image graph (see Figure 1 ) and the problem of matching as a bipartite matching problem. An interesting feature of the minimum s-t cut and bipartite matching problems is that they can both be formulated as linear programs. We can include such programs as layers in a neural network by parameterizing the coefficients of the objective function of the linear program with neural networks. However, linear programs by themselves are not continuously differentiable with respect to the objective function coefficients (Wilder et al., 2019) . A greater problem is that batch solution of linear programs using existing solvers is too inefficient for neural network models, especially when the programs have a large number of variables and constraints. We solve these problems by 1) approximating linear programs by regularized equality constrained quadratic programs, and 2) precomputing the optimality condition (KKT matrix) factorizations so that optimality equations can be quickly solved during training. The advantage of using equality constrained quadratic programs is that they can be solved simply from the optimality conditions. Combined with the appropriate precomputed factorizations for the task of object-centric learning, the optimality conditions can be solved very efficiently during training.

