NEARING OR SURPASSING: OVERALL EVALUATION OF HUMAN-MACHINE DYNAMIC VISION ABILITY Anonymous

Abstract

Dynamic visual ability (DVA), a fundamental function of the human visual system, has been successfully modeled by many computer vision tasks in recent decades. However, the prosperity developments mainly concentrate on using deep neural networks (DNN) to simulate the human DVA system, but evaluation frameworks still simply compare performance between machines, making it tough to determine how far the gap is between humans and machines in dynamic vision tasks. In fact, neglecting this issue not only makes it hard to determine the correctness of current research routes, but also cannot truly measure the DVA intelligence of machines. To answer the question, this work designs a comprehensive evaluation framework based on the 3E paradigm -we carefully pick 87 videos from various dimensions to construct the environment, confirming it can cover both perceptual and cognitive components of DVA; select 20 representative machines and 15 human subjects to form the task executors, ensuring that different model structures can help us observe the effectiveness of research development; and propose multiple evaluation indicators to quantify their DVA. Based on detailed experimental analyses, we first determine that the current algorithm research route has effectively shortened the gap. Besides, we further summarize the weaknesses of different executors, and design a human-machine cooperation mechanism with superhuman performance. In summary, the contributions include: (1) Quantifying the DVA of humans and machines, (2) proposing a new view to evaluate DVA intelligence based on the human-machine comparison, and (3) providing a possibility of human-machine cooperation. The 87 sequences with frame-level human-machine comparison and cooperation results, the toolbox for recording real-time human performance, codes for sustaining various evaluation metrics, and evaluation reports for 20 representative models will be open-sourced to help researchers develop intelligent research on dynamic vision tasks.

1. INTRODUCTION

Research on visual abilities can be dated back to the last century (Hubel & Wiesel (1959; 1962) ). Neuroscientists divide the human visual system into two categories, namely the static vision ability (SVA) to perceive the details of static objects (Chan & Courtney (1996) ), and the dynamic vision ability (DVA) to track moving objects (JW et al. (1962) ). These two visual abilities are essential in our daily life (Land & McLeod (2000) ; Beals et al. (1971) ; Burg (1966) ; Kohl et al. (1991) ), and have been modeled by a series of computer vision tasks. Recently, with the growth of dataset scale and the abundance of computing resources, most data-driven algorithms achieve higher and higher scores in experimental environments, and are widely employed in various scenarios (Dankert et al. (2009); Weissbrod et al. (2013); Wei & Kording (2018) ). However, some bad cases hidden under the prosperity development challenge the state-of-the-art (SOTA) algorithms. For example, visual models usually decrease their perception when encountering unknown-category targets under special illumination conditions (e.g., a vision-based autonomous vehicle crashes into a large truck at night). This shortcoming is far from humans' powerful visual abilities and may cause safety hazards, causing us to rethink -with the support of massive datasets and powerful computing resources, why can't SOTA models achieve a similar visual ability to humans? Do existing evaluation methods actually measure the visual intelligence of machines? 2015)), causing the goal to become downward (i.e., a method that exceeds all others is considered excellent). In fact, these evaluation mechanisms actually measure the machine's performance rather than intelligence. When we refer to machine intelligence, a natural association is Turing test (Turing ( 2009)), which indicates that machine intelligence evaluation requires human participation, and has gradually attracted scholars to propose a series of essential works in decisionmaking problems (e.g., AlphaGo (Silver et al. ( 2017)) and DeepStack (Brown & Sandholm (2018))). Using humans as a reference, the goal will become upward -exemplary machines must gradually move closer to human capabilities. In other words, exploring the gap between humans and machines is very crucial for machine intelligence evaluation. Based on this idea, recent works have introduced humans to several vision tasks, using human baseline to measure machine intelligence. Some researchers endeavor to analyze the gap between machines and humans in image classification (Geirhos et al. (2018; 2020; 2021) ), others investigate the attention areas in static images to explore visual selectivity (Langlois et al. (2021) ). Regrettably, their research scope is mainly restricted to static vision tasks, while neuroscience and cognitive psychology studies have shown that the correlation between SVA and DVA is naturally low (Long & Penn (1987) ). On the other hand, existing research on DVA evaluation for human and machine is entirely separate. As shown in Figure 1 (a), neuroscientists use toy environments to measure human DVA (Pylyshyn & Storm (1988) ), while computer scientists evaluate machines through large-scale datasets (Fan et al. (2020) ; Huang et al. ( 2021)) -neither the environment nor the evaluation mechanism are compatible, causing the comparison of human-machine DVA to be impossible. Therefore, an intuitive question is, how to compare humans and machines in dynamic vision ability? To answer this question, we should first select a suitable computer vision task to represent human DVA, then design a evaluation framework to accomplish human-machine DVA measurement and comparison. Compared with other tasks like multi-object tracking (MOT) (Ciaparrone et al. ( 2019)) and video object detection (VOD) (Wang et al. ( 2018)), single object tracking (SOT, i.e., locates a user-specified moving target in a video) (Wu et al. (2015) ) is a category-independent task with no constraint on motion continuity, scene change, or object category, and can be regarded as the closest task to represent human DVA (Appendix A.1). Due to the human-like task definition of SOT, excellent task executors should not only keep tracking the moving target (perception) but also re-locate the target when its position is mutated (cognition) (Appendix A.2). From above analyses, we select SOT as the representative task, and follow 3E paradigm (Hu et al. (2022a) ) to design an overall evaluation framework (Figure 1 ). The technical difficulties and our contributions are as follows. Experimental environment construction (Section 2.1). The first difficulty in environment construction is compatibility. Choosing a high-contrast toy environment used in classical neuroscience work is too simple to evaluate DVA accurately. On the other hand, human psychophysical experiments are expensive and time-consuming, and cannot be assessed on large-scale datasets like machines. Based on this, the second difficulty is representativity. With the limitation of dataset scale, the environment should not only fully represent the characteristics and difficulties of DVA, but also provide a graded experimental setup for subsequent analyses. As shown in Table 1 , to entirely reflect the task characteristics and thoroughly compare the human-machine DVA, we choose 87 videos



Figure 1: Comparison of the evaluation frameworks in dynamic vision abilities. (a) Existing research on DVA evaluation for human and machine is entirely separate. (b) We follow 3E paradigm (Hu et al. (2022a)) to design an overall evaluation framework of human-machine DVA.

