MID-VISION FEEDBACK

Abstract

Feedback plays a prominent role in biological vision, where perception is modulated based on agents' evolving expectations and world model. We introduce a novel mechanism which modulates perception based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is "expected" its associated linear transformation is applied over feature vectors in a mid level of a network. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness. 1

1. INTRODUCTION

In most contemporary computer vision architectures information flows in a single direction: from low-level of pixels up to high level abstract concepts (e.g., object categories) -such architectures are termed feed-forward architectures. In general, each successive layer of the network contains more abstract representations than the previous, and the representational hierarchy mirrors the architectural hierarchy. It is also possible to introduce top-down connections into the network architecture, introducing high level information into processes involving lower levels of abstraction in a process of feedback. Feedback plays a primary role in biological vision; in fact, the majority of neural connections in the visual cortex are top-down, rather than bottom-up, connections (Markov et al., 2014) . These topdown connections are thought to convey information of higher level expectation, and neurons of the visual cortex use both higher level expectation as well as lower level visual information in producing their representations. Expectations in biological systems arise from continuous engagement with the environment. In Computer Vision, this is reflected in the paradigm of Active Vision (Bajcsy, 1988; Fermüller & Aloimonos, 1995) , where perception is framed as an active problem involving evolving world models. 2012) from low level input is under-constrained -many plausible mid-level interpretations may be consistent with input. To give an intuition for how understanding of context can impact perception of mid-level features consider Figure 1 -characteristics of shrews and kiwi differ, but may be similar enough to be confused without context. Top-down feedback -from high level context to mid-level visual features -provides a "map" for mid-level processes, constraining it towards high level consistency. Figure 1 : Here we illustrate images cropped to exclude context. At first glance, due to similarities in color, texture, and pattern, images from the top row (animate) may appear to be of the same class as those of the bottom row (inanimate). With an understanding of the difference of context, upon closer inspection, it is clear that there are meaningful lower level feature differences. Introduction of contextual knowledge through feedback is superior to post-hoc application of contextual knowledge, e.g. through discarding interpretations (classifications, here) which are not context consistent. We demonstrate this point empirically. Interpretations selected after post-hoc filtering for context consistency will still be built upon underconstrained mid-level features. Furthermore, in contrast to post-hoc filtering, feedback naturally allows for detection of out-of-context objects, as feedback functions through biasing of visual representations rather than filtering. It is valuable for methods to allow for out-of-context detections, even when biasing against them, as out-of-context objects on occasion appear (e.g., a tree in an office setting). Figure 2 : Artist illustration of decoupled representations, as may be learned by a CNN. For illustration we annotate characteristics with visually recognizable categories, though in this work we exploit the tendency towards decoupled representations at lower levels. Feature vector angle corresponds to characteristic type, while feature vector magnitude corresponds to within characteristic variation or degree. (Liu et al., 2018) observe that CNNs produce decoupled representations, and derive a similar illustration over MNIST by setting a convolution operator's dimension to 2. CNNs have a natural tendency towards decoupled representations -representations with a tendency for feature vector angle to correspond to characteristic type (e.g., "fuzzy"), and for feature vector magnitude to correspond to characteristic variation or degree (Liu et al., 2018 ) (e.g., "very fuzzy" / "not fuzzy") (See Figure 2 for an illustration). This opens up a couple of possibilities in terms of directly manipulating feature representations: 1) We can differentiate between axes with different associations to high level contexts, 2) we can control magnitudes of characteristics through amplifying and dampening axes associated with those characteristics. That is, w.r.t. point #1, as CNNs produce representations which are, to a degree, separated by angle, certain axes will be more associated with some higher level contexts over others. Also, w.r.t. point #2, amplifying characteristics associated with a higher level context increases the likelihood of interpreting input as conforming to that context; dampening characteristics associated with that context reduces the likelihood of interpreting input as conforming to that context. We present a principled method to feedback -Mid-Vision Feedback (MVF), illustrated in Figure 3 -allowing the biasing of mid-level feature representations in networks such as CNNs towards conformance with high level categorical expectations. This approach is comprised of two components: 1) linear transforms (affine transformations), and 2) orthogonalization bias.



Code will be available at: https://github.com/maynord/Mid-Vision-Feedback



The task of producing mid-level visual representations Teo et al. (2015a;b); Xu et al. (2012); Nishigaki et al. (

