NEARING OR SURPASSING: OVERALL EVALUATION OF HUMAN-MACHINE DYNAMIC VISION ABILITY Anonymous

Abstract

Dynamic visual ability (DVA), a fundamental function of the human visual system, has been successfully modeled by many computer vision tasks in recent decades. However, the prosperity developments mainly concentrate on using deep neural networks (DNN) to simulate the human DVA system, but evaluation frameworks still simply compare performance between machines, making it tough to determine how far the gap is between humans and machines in dynamic vision tasks. In fact, neglecting this issue not only makes it hard to determine the correctness of current research routes, but also cannot truly measure the DVA intelligence of machines. To answer the question, this work designs a comprehensive evaluation framework based on the 3E paradigm -we carefully pick 87 videos from various dimensions to construct the environment, confirming it can cover both perceptual and cognitive components of DVA; select 20 representative machines and 15 human subjects to form the task executors, ensuring that different model structures can help us observe the effectiveness of research development; and propose multiple evaluation indicators to quantify their DVA. Based on detailed experimental analyses, we first determine that the current algorithm research route has effectively shortened the gap. Besides, we further summarize the weaknesses of different executors, and design a human-machine cooperation mechanism with superhuman performance. In summary, the contributions include: (1) Quantifying the DVA of humans and machines, (2) proposing a new view to evaluate DVA intelligence based on the human-machine comparison, and (3) providing a possibility of human-machine cooperation. The 87 sequences with frame-level human-machine comparison and cooperation results, the toolbox for recording real-time human performance, codes for sustaining various evaluation metrics, and evaluation reports for 20 representative models will be open-sourced to help researchers develop intelligent research on dynamic vision tasks.

1. INTRODUCTION

Research on visual abilities can be dated back to the last century (Hubel & Wiesel (1959; 1962) ). Neuroscientists divide the human visual system into two categories, namely the static vision ability (SVA) to perceive the details of static objects (Chan & Courtney (1996) ), and the dynamic vision ability (DVA) to track moving objects (JW et al. (1962) ). These two visual abilities are essential in our daily life (Land & McLeod (2000) ; Beals et al. (1971); Burg (1966); Kohl et al. (1991) ), and have been modeled by a series of computer vision tasks. Recently, with the growth of dataset scale and the abundance of computing resources, most data-driven algorithms achieve higher and higher scores in experimental environments, and are widely employed in various scenarios (Dankert et al. ( 2009 However, some bad cases hidden under the prosperity development challenge the state-of-the-art (SOTA) algorithms. For example, visual models usually decrease their perception when encountering unknown-category targets under special illumination conditions (e.g., a vision-based autonomous vehicle crashes into a large truck at night). This shortcoming is far from humans' powerful visual abilities and may cause safety hazards, causing us to rethink -with the support of massive datasets and powerful computing resources, why can't SOTA models achieve a similar visual ability to humans? Do existing evaluation methods actually measure the visual intelligence of machines?



); Weissbrod et al. (2013); Wei & Kording (2018)).

