HUMANLY CERTIFYING SUPERHUMAN CLASSIFIERS

Abstract

This paper addresses a key question in current machine learning research: if we believe that a model's predictions might be better than those given by human experts, how can we (humans) verify these beliefs? In some cases, this "superhuman" performance is readily demonstrated; for example by defeating top-tier human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators are subjective and can make mistakes. Evaluating the performance with respect to a genuine oracle is more objective and reliable, even when querying the oracle is more expensive or sometimes impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides an executable recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our mild assumptions a number of models from recent years have already achieved superhuman performance with high probability-suggesting that our new oracle based performance evaluation metrics are overdue as an alternative to the widely used accuracy metrics that are naively based on imperfect human annotations.

1. INTRODUCTION

Artificial Intelligence (AI) agents have begun to outperform humans on remarkably challenging tasks; AlphaGo defeated top ranked Go players (Silver et al., 2016; Singh et al., 2017) , and Ope-nAI's Dota2 AI has defeated human world champions of the game (Berner et al., 2019) . These AI tasks may be evaluated objectively, e.g., using the total score achieved in a game and the victory against another player. However, for supervised learning tasks such as image classification and sentiment analysis, certifying a machine learning model as superhuman is subjectively tied to human judgments rather than comparing with an oracle. We focus on paving a way towards evaluating models with potentially superhuman performance in classification. When evaluating the performance of a classification model, we generally rely on the accuracy of the predicted labels with regard to ground truth labels, which we call the oracle accuracy. However, or-acle labels may arguably be unobservable. For tasks such as object detection and saliency detection, the predictions are subjective to many factors of the annotators, e.g., their background and physical or mental state. For other tasks, even experts may not be able to summarize an explicit rule for the prediction, such as predicting molecule toxicity and stability. Without observing oracle labels researchers often resort to two heuristics, i) human predictions or aggregated human annotations are effectively treated as ground truth (Wang et al., 2018; Lin et al., 2014; Wang et al., 2019) to approximate the oracle, and ii) the inter-annotator aggreement is taken as the best possible machine learning model performance (for an extensive survey of works that make this claim without proof, see the works cited within (Boguslav & Cohen, 2017; Richie et al., 2022)) . This heuristic approach suffers some key disadvantages. Firstly, the quality control of human annotation is challenging (Artstein, 2017; Lampert et al., 2016) . Secondly, current evaluation paradigms focus on evaluating the performance of models, but not the oracle accuracy of humans -yet we cannot claim that a machine learning model is superhuman without properly estimating the human performance as compared to the oracle. Thirdly, as machine learning models exceed human performance on important tasks, it becomes insufficient to merely report the agreement of the model to human annotations. < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " h P + 6 L r U f 2 d 3 t Z a l d q a Q Q v E K M X y w = " > A A A B 2 X i c b Z D N S g M x F I X v 1 L 8 6 V q 1 r N 8 E i u C o z b n Q p u H F Z w b Z C O 5 R M 5 k 4 b m s k M y R 2 h D H 0 B F 2 5 E f C 9 3 v o 3 p z 0 J b D w Q + z k n I v S c u l L Q U B N 9 e b W d 3 b / + g f u g f N f z j k 9 N m o 2 f z 0 g j s i l z l 5 j n m F p X U 2 C V J C p 8 L g z y L F f b j 6 f 0 i 7 7 + g s T L X T z Q r M M r 4 W M t U C k 7 O 6 o y a r a A d L M W 2 I V x D C 9 Y a N b + G S S 7 K D D U J x a 0 d h E F B U c U N S a F w 7 g 9 L i w U X U z 7 G g U P N M 7 R R t R x z z i 6 d k 7 A 0 N + 5 o Y k v 3 9 4 u K Z 9 b O s t j d z D h N 7 G a 2 M P / L B i W l t 1 E l d V E S a r H 6 K C 0 V o 5 w t d m a J N C h I z R x w Y a S b l Y k J N 1 y Q a 8 Z 3 H Y S b G 2 9 D 7 7 o d B u 3 w M Y A 6 n M M F X E E I N 3 A H D 9 C B L g h I 4 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p l m + c 0 m U 0 4 m e 4 4 K q D / 2 E 1 x a Y e b 4 = " > A A A B 9 n i c b V B N S 8 N A E J 3 U r 1 q r R k + C l 8 U i 1 E t J v O h F E L x 4 r G A / o A l h s 9 2 0 a z e b s L s R S q g X / 4 o X D 4 r 4 U 7 z 5 b 9 y k P W j r g 2 U e 7 8 2 w M y 9 M O V P a c b 6 t y t r 6 x u Z W d b u 2 U 9 / d 2 7 c P 6 l 2 V Z J L Q D k l 4 I v s h V p Q z Q T u a a U 7 7 q a Q 4 D j n t h Z O b w u 8 9 U q l Y I u 7 1 N K V + j E e C R Y x g b a T A P v J i r M d h i N p N j 3 I e s K u y P J w F d s N p O S X Q K n E X p A E L t A P 7 y x s m J I u p 0 I R j p Q a u k 2 o / x 1 I z w u m s 5 m W K p p h M 8 I g O D B U 4 p s r P y w t m 6 N Q o Q x Q l 0 j y h U a n + n s h x r N Q 0 D k 1 n s a 9 a 9 g r x P 2 + Q 6 e j S z 5 l I M 0 0 F m X 8 U Z R z p B B V x o C G T l G g + N Q Q T y c y u i I y x x E S b 0 G o m B H f 5 5 F X S P W + 5 T s u 9 c 6 A K x 3 A C T X D h A q 7 h F t r Q A Q J P 8 A J v 8 G 4 9 W 6 / W x z y u i r X I 7 R D + w P r 8 A d 2 x l L o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p l m + c 0 m U 0 4 m e 4 4 K q D / 2 E 1 x a Y e b 4 = " > A A A B 9 n i c b V B N S 8 N A E J 3 U r 1 q r R k + C l 8 U i 1 E t J v O h F E L x 4 r G A / o A l h s 9 2 0 a z e b s L s R S q g X / 4 o X D 4 r 4 U 7 z 5 b 9 y k P W j r g 2 U e 7 8 2 w M y 9 M O V P a c b 6 t y t r 6 x u Z W d b u 2 U 9 / d 2 7 c P 6 l 2 V Z J L Q D k l 4 I v s h V p Q z Q T u a a U 7 7 q a Q 4 D j n t h Z O b w u 8 9 U q l Y I u 7 1 N K V + j E e C R Y x g b a T A P v J i r M d h i N p N j 3 I e s K u y P J w F d s N p O S X Q K n E X p A E L t A P 7 y x s m J I u p 0 I R j p Q a u k 2 o / x 1 I z w u m s 5 m W K p p h M 8 I g O D B U 4 p s r P y w t m 6 N Q o Q x Q l 0 j y h U a n + n s h x r N Q 0 D k 1 n s a 9 a 9 g r x P 2 + Q 6 e j S z 5 l I M 0 0 F m X 8 U Z R z p B B V x o C G T l G g + N Q Q T y c y u i I y x x E S b 0 G o m B H f 5 5 F X S P W + 5 T s u 9 c 6 A K x 3 A C T X D h A q 7 h F t r Q A Q J P 8 A J v 8 G 4 9 W 6 / W x z y u i r X I 7 R D + w P r 8 A d 2 x l L o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U N i T d B M d Y J h F H 7 T o t r + z Q P l U A q Y = " > A A A C A X i c b V D L S s N A F J 3 4 r P U V d S O 4 G S x C 3 Z T E j W 6 E o h u X F e w D m h A m 0 5 t 2 7 O T B z E Q o o W 7 8 F T c u F H H r X 7 j z b 5 y k W W j r g W E O 5 9 z L v f f 4 C W d S W d a 3 s b S 8 s r q 2 X t m o b m 5 t 7 + y a e / s d G a e C Q p v G P B Y 9 n 0 j g L I K 2 Y o p D L x F A Q p 9 D 1 x 9 f 5 3 7 3 A Y R k c X S n J g m 4 I R l G L G C U K C 1 5 5 q E T E j X y f d y q O 8 C 5 x y 6 L 7 / 7 U M 2 t W w y q A F 4 l d k h o q 0 f L M L 2 c Q 0 z S E S F F O p O z b V q L c j A j F K I d p 1 U k l J I S O y R D 6 m k Y k B O l m x Q V T f K K V A Q 5 i o V + k c K H + 7 s h I K O U k 9 H V l v q + c 9 3 L x P 6 + f q u D C z V i U p A o i O h s U p B y r G O d x 4 A E T Q B W f a E K o Y H p X T E d E E K p 0 a F U d g j 1 / 8 i L p n D V s q 2 H f W r X m V R l H B R 2 h Y 1 R H N j p H T X S D W q i N K H p E z + g V v R l P x o v x b n z M S p e M s u c A / Y H x + Q N R d p Y i < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " S W L Y R 3 x F W h + K G D m T 3 i c U 5 r s X 3 s 8 = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B z W A R 6 q Y k I u h G K L p x W c E + o A l h M r 1 p x 0 4 e z E y E E u r G X 3 H j Q h G 3 / o U 7 / 8 Z J m 4 W 2 H h j m c M 6 9 3 H u P n 3 A m l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y z g V F F o 0 5 r H o + k Q C Z x G 0 F F M c u o k A E v o c O v 7 o O v c 7 D y A k i 6 M 7 N U 7 A D c k g Y g G j R G n J M w + c k K i h 7 + N m z Q H O P X Y 5 / e 5 P P L N q 1 a 0 p 8 C K x C 1 J F B Z q e + e X 0 Y 5 q G E C n K i Z Q 9 2 0 q U m x G h G O U w q T i p h I T Q E R l A T 9 O I h C D d b H r B B B 9 r p Y + D W O g X K T x V f 3 d k J J R y H P q 6 M t 9 X z n u 5 + J / X S 1 V w 4 W Y s S l I F E Z 0 N C l K O V Y z z O H C f C a C K j z U h V D C 9 K 6 Z D I g h V O r S K D s G e P 3 m R t E / r t l W 3 b 8 + q j a s i j j I 6 R E e o h m x 0 j h r o B j V R C 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m J W W j K J n H / 2 B 8 f k D U r a W J g = = < / l a t e x i t > (i) < l a t e x i t s h a 1 _ b a s e 6 4 = " T u J P U h o p h 8 x / 5 9 p i 4 / i 7 e X E U k S s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W t B / Q h r L Z T t q l m 0 3 Y 3 Q g l 9 C d 4 8 a C I V 3 + R N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 5 n f f k K l e S w f z S R B P 6 J D y U P O q L H S Q 5 W f 9 8 s V t + b O Q V a J l 5 M K 5 G j 0 y 1 + 9 Q c z S C K V h g m r d 9 d z E + B l V h j O B 0 1 I v 1 Z h Q N q Z D 7 F o q a Y T a z + a n T s m Z V Q Y k j J U t a c h c / T 2 R 0 U j r S R T Y z o i a k V 7 2 Z u J / X j c 1 4 b W f c Z m k B i V b L A p T Q U x M Z n + T A V f I j J h Y Q p n i 9 l b C R l R R Z m w 6 J R u C t / z y K m l d 1 D y 3 5 t 1 f V u o 3 e R x F O I F T q I I H V 1 C H O 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g C U 0 4 1 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T u J P U h o p h 8 x / 5 9 p i 4 / i 7 e X E U k S s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W t B / Q h r L Z T t q l m 0 3 Y 3 Q g l 9 C d 4 8 a C I V 3 + R N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 5 n f f k K l e S w f z S R B P 6 J D y U P O q L H S Q 5 W f 9 8 s V t + b O Q V a J l 5 M K 5 G j 0 y 1 + 9 Q c z S C K V h g m r d 9 d z E + B l V h j O B 0 1 I v 1 Z h Q N q Z D 7 F o q a Y T a z + a n T s m Z V Q Y k j J U t a c h c / T 2 R 0 U j r S R T Y z o i a k V 7 2 Z u J / X j c 1 4 b W f c Z m k B i V b L A p T Q U x M Z n + T A V f I j J h Y Q p n i 9 l b C R l R R Z m w 6 J R u C t / z y K m l d 1 D y 3 5 t 1 f V u o 3 e R x F O I F T q I I H V 1 C H O 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g C U 0 4 1 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T u J P U h o p h 8 x / 5 9 p i 4 / i 7 e X E U k S s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W t B / Q h r L Z T t q l m 0 3 Y 3 Q g l 9 C d 4 8 a C I V 3 + R N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 5 n f f k K l e S w f z S R B P 6 J D y U P O q L H S Q 5 W f 9 8 s V t + b O Q V a J l 5 M K 5 G j 0 y 1 + 9 Q c z S C K V h g m r d 9 d z E + B l V h j O B 0 1 I v 1 Z h Q N q Z D 7 F o q a Y T a z + a n T s m Z V Q Y k j J U t a c h c / T 2 R 0 U j r S R T Y z o i a k V 7 2 Z u J / X j c 1 4 b W f c Z m k B i V b L A p T Q U x M Z n + T A V f I j J h Y Q p n i 9 l b C R l R R Z m w 6 J R u C t / z y K m l d 1 D y 3 5 t 1 f V u o 3 e R x F O I F T q I I H V 1 C H O 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g C U 0 4 1 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T u J P U h o p h 8 x / 5 9 p i 4 / i 7 e X E U k S s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W t B / Q h r L Z T t q l m 0 3 Y 3 Q g l 9 C d 4 8 a C I V 3 + R N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 5 n f f k K l e S w f z S R B P 6 J D y U P O q L H S Q 5 W f 9 8 s V t + b O Q V a J l 5 M K 5 G j 0 y 1 + 9 Q c z S C K V h g m r d 9 d z E + B l V h j O B 0 1 I v 1 Z h Q N q Z D 7 F o q a Y T a z + a n T s m Z V Q Y k j J U t a c h c / T 2 R 0 U j r S R T Y z o i a k V 7 2 Z u J / X j c 1 4 b W f c Z m k B i V b L A p T Q U x M Z n + T A V f I j J h Y Q p n i 9 l b C R l R R Z m w 6 J R u C t / z y K m l d 1 D y 3 5 t 1 f V u o 3 e R x F O I F T q I I H V 1 C H O 2 h A E x g M 4 R l e 4 c 0 R z o v z 7 n w s W g t O P n M M f + B 8 / g C U 0 4 1 S < / l a t e x i t > (j) < l a t e x i t s h a 1 _ b a s e 6 4 = " p a K e s X c d u i A m E E g 1 Q M 5 E m 1 x A r U U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S L U S 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 2 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f z s r q 2 v r G Z m G r u L 2 z u 7 d f O j h s m j j V j D d Y L G P d D q j h U i j e Q I G S t x P N a R R I 3 g p G N 1 O / 9 c S 1 E b F 6 w H H C / Y g O l A g F o 2 i l + 8 r j W a 9 U d q v u D G S Z e D k p Q 4 5 6 r / T V 7 c c s j b h C J q k x H c 9 N 0 M + o R s E k n x S 7 q e E J Z S M 6 4 B 1 L F Y 2 4 8 b P Z q R N y a p U + C W N t S y G Z q b 8 n M h o Z M 4 4 C 2 x l R H J p F b y r + 5 3 V S D K / 8 T K g k R a 7 Y f F G Y S o I x m f 5 N + k J z h n J s C W V a 2 F s J G 1 J N G d p 0 i j Y E b / H l Z d I 8 r 3 p u 1 b u 7 K N e u 8 z g K c A w n U A E P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t 6 4 4 + c w R / I H z + Q O W W I 1 T < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p a K e s X c d u i A m E E g 1 Q M 5 E m 1 x A r U U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S L U S 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 2 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f z s r q 2 v r G Z m G r u L 2 z u 7 d f O j h s m j j V j D d Y L G P d D q j h U i j e Q I G S t x P N a R R I 3 g p G N 1 O / 9 c S 1 E b F 6 w H H C / Y g O l A g F o 2 i l + 8 r j W a 9 U d q v u D G S Z e D k p Q 4 5 6 r / T V 7 c c s j b h C J q k x H c 9 N 0 M + o R s E k n x S 7 q e E J Z S M 6 4 B 1 L F Y 2 4 8 b P Z q R N y a p U + C W N t S y G Z q b 8 n M h o Z M 4 4 C 2 x l R H J p F b y r + 5 3 V S D K / 8 T K g k R a 7 Y f F G Y S o I x m f 5 N + k J z h n J s C W V a 2 F s J G 1 J N G d p 0 i j Y E b / H l Z d I 8 r 3 p u 1 b u 7 K N e u 8 z g K c A w n U A E P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t 6 4 4 + c w R / I H z + Q O W W I 1 T < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p a K e s X c d u i A m E E g 1 Q M 5 E m 1 x A r U U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S L U S 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 2 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f z s r q 2 v r G Z m G r u L 2 z u 7 d f O j h s m j j V j D d Y L G P d D q j h U i j e Q I G S t x P N a R R I 3 g p G N 1 O / 9 c S 1 E b F 6 w H H C / Y g O l A g F o 2 i l + 8 r j W a 9 U d q v u D G S Z e D k p Q 4 5 6 r / T V 7 c c s j b h C J q k x H c 9 N 0 M + o R s E k n x S 7 q e E J Z S M 6 4 B 1 L F Y 2 4 8 b P Z q R N y a p U + C W N t S y G Z q b 8 n M h o Z M 4 4 C 2 x l R H J p F b y r + 5 3 V S D K / 8 T K g k R a 7 Y f F G Y S o I x m f 5 N + k J z h n J s C W V a 2 F s J G 1 J N G d p 0 i j Y E b / H l Z d I 8 r 3 p u 1 b u 7 K N e u 8 z g K c A w n U A E P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t 6 4 4 + c w R / I H z + Q O W W I 1 T < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " p a K e s X c d u i A m E E g 1 Q M 5 E m 1 x A r U U = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S L U S 0 l E 0 G P R i 8 e K 9 g P a U D b b T b t 2 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S K Q w 6 L r f z s r q 2 v r G Z m G r u L 2 z u 7 d f O j h s m j j V j D d Y L G P d D q j h U i j e Q I G S t x P N a R R I 3 g p G N 1 O / 9 c S 1 E b F 6 w H H C / Y g O l A g F o 2 i l + 8 r j W a 9 U d q v u D G S Z e D k p Q 4 5 6 r / T V 7 c c s j b h C J q k x H c 9 N 0 M + o R s E k n x S 7 q e E J Z S M 6 4 B 1 L F Y 2 4 8 b P Z q R N y a p U + C W N t S y G Z q b 8 n M h o Z M 4 4 C 2 x l R H J p F b y r + 5 3 V S D K / 8 T K g k R a 7 Y f F G Y S o I x m f 5 N + k J z h n J s C W V a 2 F s J G 1 J N G d p 0 i j Y E b / H l Z d I 8 r 3 p u 1 b u 7 K N e u 8 z g K c A w n U A E P L q E G t 1 C H B j A Y w D O 8 w p s j n R f n 3 f m Y t 6 4 4 + c w R / I H z + Q O W W I 1 T < / l a t e x i t > P ( `i = `? ) < l a t e x i t s h a 1 _ b a s e 6 4 = " M + 7 J E N g O 0 + A g D J G q y N w 6 4 4 u t s E w = " > A A A C B 3 i c d V D L S s N A F J 3 4 r P U V d S n I Y B H q J i S 1 t L o Q i m 5 c V r A P a E K Y T C f t 0 M m D m Y l Q Q n d u / B U 3 L h R x 6 y + 4 8 2 + c p B F U 9 M A w h 3 P u 5 d 5 7 v J h R I U 3 z Q 1 t Y X F p e W S 2 t l d c 3 N r e 2 9 Z 3 d r o g S j k k H R y z i f Q 8 J w m h I O p J K R v o x J y j w G O l 5 k 8 v M 7 9 0 S L m g U 3 s h p T J w A j U L q U 4 y k k l z 9 w A 6 Q H H s e b F d t w p h L z / M v t Y V E f H b s 6 h X T q N d q j R M T z s l Z s y B m A 1 q G m a M C C r R d / d 0 e R j g J S C g x Q 0 I M L D O W T o q 4 p J i R W d l O B I k R n q A R G S g a o o A I J 8 3 v m M E j p Q y h H 3 H 1 Q g l z 9 X t H i g I h p o G n K r O t x W 8 v E / / y B o n 0 T 5 2 U h n E i S Y j n g / y E Q R n B L B Q 4 p J x g y a a K I M y p 2 h X i M e I I S x V d W Y X w d S n 8 n 3 R r h m U a 1 n W 9 0 r o o 4 i i B f X A I q s A C T d A C V 6 A N O g C D O / A A n s C z d q 8 9 a i / a 6 7 x 0 Q S t 6 9 s A P a G + f z x 6 Z S Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M + 7 J E N g O 0 + A g D J G q y N w 6 4 4 u t s E w = " > A A A C B 3 i c d V D L S s N A F J 3 4 r P U V d S n I Y B H q J i S 1 t L o Q i m 5 c V r A P a E K Y T C f t 0 M m D m Y l Q Q n d u / B U 3 L h R x 6 y + 4 8 2 + c p B F U 9 M A w h 3 P u 5 d 5 7 v J h R I U 3 z Q 1 t Y X F p e W S 2 t l d c 3 N r e 2 9 Z 3 d r o g S j k k H R y z i f Q 8 J w m h I O p J K R v o x J y j w G O l 5 k 8 v M 7 9 0 S L m g U 3 s h p T J w A j U L q U 4 y k k l z 9 w A 6 Q H H s e b F d t w p h L z / M v t Y V E f H b s 6 h X T q N d q j R M T z s l Z s y B m A 1 q G m a M C C r R d / d 0 e R j g J S C g x Q 0 I M L D O W T o q 4 p J i R W d l O B I k R n q A R G S g a o o A I J 8 3 v m M E j p Q y h H 3 H 1 Q g l z 9 X t H i g I h p o G n K r O t x W 8 v E / / y B o n 0 T 5 2 U h n E i S Y j n g / y E Q R n B L B Q 4 p J x g y a a K I M y p 2 h X i M e I I S x V d W Y X w d S n 8 n 3 R r h m U a 1 n W 9 0 r o o 4 i i B f X A I q s A C T d A C V 6 A N O g C D O / A A n s C z d q 8 9 a i / a 6 7 x 0 Q S t 6 9 s A P a G + f z x 6 Z S Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M + 7 J E N g O 0 + A g D J G q y N w 6 4 4 u t s E w = " > A A A C B 3 i c d V D L S s N A F J 3 4 r P U V d S n I Y B H q J i S 1 t L o Q i m 5 c V r A P a E K Y T C f t 0 M m D m Y l Q Q n d u / B U 3 L h R x 6 y + 4 8 2 + c p B F U 9 M A w h 3 P u 5 d 5 7 v J h R I U 3 z Q 1 t Y X F p e W S 2 t l d c 3 N r e 2 9 Z 3 d r o g S j k k H R y z i f Q 8 J w m h I O p J K R v o x J y j w G O l 5 k 8 v M 7 9 0 S L m g U 3 s h p T J w A j U L q U 4 y k k l z 9 w A 6 Q H H s e b F d t w p h L z / M v t Y V E f H b s 6 h X T q N d q j R M T z s l Z s y B m A 1 q G m a M C C r R d / d 0 e R j g J S C g x Q 0 I M L D O W T o q 4 p J i R W d l O B I k R n q A R G S g a o o A I J 8 3 v m M E j p Q y h H 3 H 1 Q g l z 9 X t H i g I h p o G n K r O t x W 8 v E / / y B o n 0 T 5 2 U h n E i S Y j n g / y E Q R n B L B Q 4 p J x g y a a K I M y p 2 h X i M e I I S x V d W Y X w d S n 8 n 3 R r h m U a 1 n W 9 0 r o o 4 i i B f X A I q s A C T d A C V 6 A N O g C D O / A A n s C z d q 8 9 a i / a 6 7 x 0 Q S t 6 9 s A P a G + f z x 6 Z S Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M + 7 J E N g O 0 + A g D J G q y N w 6 4 4 u t s E w = " > A A A C B 3 i c d V D L S s N A F J 3 4 r P U V d S n I Y B H q J i S 1 t L o Q i m 5 c V r A P a E K Y T C f t 0 M m D m Y l Q Q n d u / B U 3 L h R x 6 y + 4 8 2 + c p B F U 9 M A w h 3 P u 5 d 5 7 v J h R I U 3 z Q 1 t Y X F p e W S 2 t l d c 3 N r e 2 9 Z 3 d r o g S j k k H R y z i f Q 8 J w m h I O p J K R v o x J y j w G O l 5 k 8 v M 7 9 0 S L m g U 3 s h p T J w A j U L q U 4 y k k l z 9 w A 6 Q H H s e b F d t w p h L z / M v t Y V E f H b s 6 h X T q N d q j R M T z s l Z s y B m A 1 q G m a M C C r R d / d 0 e R j g J S C g x Q 0 I M L D O W T o q 4 p J i R W d l O B I k R n q A R G S g a o o A I J 8 3 v m M E j p Q y h H 3 H 1 Q g l z 9 X t H i g I h p o G n K r O t x W 8 v E / / y B o n 0 T 5 2 U h n E i S Y j n g / y E Q R n B L B Q 4 p J x g y a a K I M y p 2 h X i M e I I S x V d W Y X w d S n 8 n 3 R r h m U a 1 n W 9 0 r o o 4 i i B f X A I q s A C T d A C V 6 A N O g C D O / A A n s C z d q 8 9 a i / a 6 7 x 0 Q S t 6 9 s A P a G + f z x 6 Z S Q = = < / l a t e x i t > P ( `j = `?) < l a t e x i t s h a 1 _ b a s e 6 4 = " J e v 8 x / Q e j F q i q 7 D H L Z E S I o j O y i 4 = " > A A A C B 3 i c d V D L S s N A F J 3 U V 6 2 v q E t B B o t Q N y G p p d W F U H T j s o J 9 Q B P C Z D p t x 0 4 e z E y E E r p z 4 6 + 4 c a G I W 3 / B n X / j J I 2 g o g e G O Z x z L / f e 4 0 W M C m m a H 1 p h Y X F p e a W 4 W l p b 3 9 j c 0 r d 3 O i K M O S Z t H L K Q 9 z w k C K M B a U s q G e l F n C D f Y 6 T r T S 5 S v 3 t L u K B h c C 2 n E X F 8 N A r o k G I k l e T q + 7 a P 5 N j z Y K t i E 8 b c m 7 P s S 2 w h E Z 8 d u X r Z N G r V a v 3 Y h H N y 2 s i J W Y e W Y W Y o g x w t V 3 + 3 B y G O f R J I z J A Q f c u M p J M g L i l m Z F a y Y 0 E i h C d o R P q K B s g n w k m y O 2 b w U C k D O A y 5 e o G E m f q 9 I 0 G + E F P f U 5 X p 1 u K 3 l 4 p / e f 1 Y D k + c h A Z R L E m A 5 4 O G M Y M y h G k o c E A 5 w Z J N F U G Y U 7 U r x G P E E Z Y q u p I K 4 e t S + D / p V A 3 L N K y r W r l 5 n s d R B H v g A F S A B R q g C S 5 B C 7 Q B B n f g A T y B Z + 1 e e 9 R e t N d 5 a U H L e 3 b B D 2 h v n 9 C w m U o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J e v 8 x / Q e j F q i q 7 D H L Z E S I o j O y i 4 = " > A A A C B 3 i c d V D L S s N A F J 3 U V 6 2 v q E t B B o t Q N y G p p d W F U H T j s o J 9 Q B P C Z D p t x 0 4 e z E y E E r p z 4 6 + 4 c a G I W 3 / B n X / j J I 2 g o g e G O Z x z L / f e 4 0 W M C m m a H 1 p h Y X F p e a W 4 W l p b 3 9 j c 0 r d 3 O i K M O S Z t H L K Q 9 z w k C K M B a U s q G e l F n C D f Y 6 T r T S 5 S v 3 t L u K B h c C 2 n E X F 8 N A r o k G I k l e T q + 7 a P 5 N j z Y K t i E 8 b c m 7 P s S 2 w h E Z 8 d u X r Z N G r V a v 3 Y h H N y 2 s i J W Y e W Y W Y o g x w t V 3 + 3 B y G O f R J I z J A Q f c u M p J M g L i l m Z F a y Y 0 E i h C d o R P q K B s g n w k m y O 2 b w U C k D O A y 5 e o G E m f q 9 I 0 G + E F P f U 5 X p 1 u K 3 l 4 p / e f 1 Y D k + c h A Z R L E m A 5 4 O G M Y M y h G k o c E A 5 w Z J N F U G Y U 7 U r x G P E E Z Y q u p I K 4 e t S + D / p V A 3 L N K y r W r l 5 n s d R B H v g A F S A B R q g C S 5 B C 7 Q B B n f g A T y B Z + 1 e e 9 R e t N d 5 a U H L e 3 b B D 2 h v n 9 C w m U o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J e v 8 x / Q e j F q i q 7 D H L Z E S I o j O y i 4 = " > A A A C B 3 i c d V D L S s N A F J 3 U V 6 2 v q E t B B o t Q N y G p p d W F U H T j s o J 9 Q B P C Z D p t x 0 4 e z E y E E r p z 4 6 + 4 c a G I W 3 / B n X / j J I 2 g o g e G O Z x z L / f e 4 0 W M C m m a H 1 p h Y X F p e a W 4 W l p b 3 9 j c 0 r d 3 O i K M O S Z t H L K Q 9 z w k C K M B a U s q G e l F n C D f Y 6 T r T S 5 S v 3 t L u K B h c C 2 n E X F 8 N A r o k G I k l e T q + 7 a P 5 N j z Y K t i E 8 b c m 7 P s S 2 w h E Z 8 d u X r Z N G r V a v 3 Y h H N y 2 s i J W Y e W Y W Y o g x w t V 3 + 3 B y G O f R J I z J A Q f c u M p J M g L i l m Z F a y Y 0 E i h C d o R P q K B s g n w k m y O 2 b w U C k D O A y 5 e o G E m f q 9 I 0 G + E F P f U 5 X p 1 u K 3 l 4 p / e f 1 Y D k + c h A Z R L E m A 5 4 O G M Y M y h G k o c E A 5 w Z J N F U G Y U 7 U r x G P E E Z Y q u p I K 4 e t S + D / p V A 3 L N K y r W r l 5 n s d R B H v g A F S A B R q g C S 5 B C 7 Q B B n f g A T y B Z + 1 e e 9 R e t N d 5 a U H L e 3 b B D 2 h v n 9 C w m U o = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J e v 8 x / Q e j F q i q 7 D H L Z E S I o j O y i 4 = " > A A A C B 3 i c d V D L S s N A F J 3 U V 6 2 v q E t B B o t Q N y G p p d W F U H T j s o J 9 Q B P C Z D p t x 0 4 e z E y E E r p z 4 6 + 4 c a G I W 3 / B n X / j J I 2 g o g e G O Z x z L / f e 4 0 W M C m m a H 1 p h Y X F p e a W 4 W l p b 3 9 j c 0 r d 3 O i K M O S Z t H L K Q 9 z w k C K M B a U s q G e l F n C D f Y 6 T r T S 5 S v 3 t L u K B h c C 2 n E X F 8 N A r o k G I k l e T q + 7 a P 5 N j z Y K t i E 8 b c m 7 P s S 2 w h E Z 8 d u X r Z N G r V a v 3 Y h H N y 2 s i J W Y e W Y W Y o g x w t V 3 + 3 B y G O f R J I z J A Q f c u M p J M g L i l m Z F a y Y 0 E i h C d o R P q K B s g n w k m y O 2 b w U C k D O A y 5 e o G E m f q 9 I 0 G + E F P f U 5 X p 1 u K 3 l 4 p / e f 1 Y D k + c h A Z R L E m A 5 4 O G M Y M y h G k o c E A 5 w Z J N F U G Y U 7 U r x G P E E Z Y q u p I K 4 e t S + D / p V A 3 L N K y r W r l 5 n s d R B H v g A F S A B R q g C S 5 B C 7 Q B B n f g A T y B Z + 1 e e 9 R e t N d 5 a U H L e 3 b B D 2 h v n 9 C w m U o = < / l a t e x i t > Figure 1 : The relationship between a) the oracle accuracy of the annotators, P(ℓ i = ℓ ⋆ ), and b) the agreement between two annotators, P(ℓ i = ℓ j ). ℓ i and ℓ j are labels given by annotator i and j, ℓ ⋆ is the oracle label. In our setting, part a) is unobserved (gray) and part b) is observed (black). In this paper, we work on the setting that oracle labels are unobserved (see Figure 1 ). Within this setting is provided a theory for estimating the oracle accuracy on classification tasks which formalises what empirical works have hinted towards (Richie et al., 2022) , that machine learning classification models may outperform the humans who provide them with training supervision. Our aim is not to optimally combine machine learning systems, but rather to estimate the oracle accuracy of a single machine learning system by comparing it with the results obtained from multiple human annotators. Our theory includes i) upper bounds for the averaged oracle accuracy of the annotators, ii) lower bounds for the oracle accuracy of the model, and iii) finite sample analysis for both bounds and their margin which represents the model's outperformance. Based on our theory, we propose an algorithm to detect competitive models and to report confidence scores, which formally bound the probability that a given model outperforms the average human annotator. Empirically, we observe that some existing models for sentiment classification and natural language inference (NLI) have already achieved superhuman performance with high probability.

2. EVALUATION THEORY

We now present our theory for human annotators and machine learning models with oracle labels.

2.1. PROBLEM STATEMENT

We are given K labels crowd sourced from K human annotators, {ℓ i } K i=1 , and some labels from a model ℓ M . The probability of two annotators a i and a j possess matched annotations with the other is P(ℓ i = ℓ j ). Denote by ℓ K the label of the "average" human annotator which we define as the label obtained by selecting one of the K human annotators uniformly at random. We seek to formally compare the oracle accuracy of the average human, P(ℓ K = ℓ ⋆ ), with that of the machine learning model, P(ℓ M = ℓ ⋆ ), where ℓ ⋆ is the unobserved oracle label. Denote by ℓ G the label obtained by aggregating (say, by voting) the K human annotators' labels. We distinguish between the oracle accuracy P(ℓ M = ℓ ⋆ ) and the agreement with human annotations P(ℓ M = ℓ G ), although these two concepts have been confounded in many previous applications and benchmarks. Connection to traditional practice of accuracy calculation. Generally, the ground truth of a benchmark corpus is constructed by aggregating multiple human annotations (Wang et al., 2018; 2019) . For example, the averaged sentiment score is used in SST (Socher et al., 2013) and majority of votes in SNLI (Bowman et al., 2015) . Then, the aggregated annotations are treated as ground truth to calculate accuracy. Under this setting, the 'traditional' accuracy score evaluated on the (aggregated) human ground truth can be viewed as a special case of our lower bound.

2.4. FINITE SAMPLE ANALYSIS

The results above assume that the agreement probabilities are known; we now connect them with the finite sample case: ℓ (n) denotes the label assigned to the n-th data point in accordance to ℓ, for n = 1, 2, . . . , N . P (N ) is the empirical probability given N observations, and P is lim N →∞ P (N ) . We begin with a standard concentration inequality (see e.g. (Boucheron et al., 2013, § 2.6 )), Theorem 4 (Hoeffding's Inequality) Let X 1 , . . . , X N be independent random variables with finite variance such that P( X n ∈ [α, β]) = 1, for all 1 ≤ n ≤ N . Let X ≜ 1 N N n=1 X n , then, for any t > 0, P(X -E[X] ≥ +t) ≤ exp - 2N t 2 (α -β) 2 , P(X -E[X] ≤ -t) ≤ exp - 2N t 2 (α -β) 2 . ( ) Combining this with Thereom 1 we obtain the following. Theorem 5 (Sample Average Performance Upper Bound) With Theorem 1's assumptions and P (N ) (ℓ i = ℓ j ) = 1 N N n=1 ℓ (n) i = ℓ (n) j (9) defining the empirical agreement ratio,foot_0 and letting δ u = exp(-2N t 2 u ). With probability at least 1δ u , for any t u > 0, P(ℓ K = ℓ ⋆ ) ≤ t u + 1 K 2 K i=1 K j=1 P (N ) (ℓ i = ℓ j ). ( ) Theorem 6 (Sample Performance Lower Bound) With Theorem 3's assumptions and equation 9, define δ l = exp(-2N t 2 l ). ( ) With probability at least 1δ l , for any t l > 0, P (N ) (ℓ a = ℓ b ) ≤ t l + P(ℓ b = ℓ ⋆ ).

2.5. DETECTING AND CERTIFYING SUPERHUMAN MODELS

We propose a procedure to discover potentially superhuman models based on our theorems. 1. Calculate the upper bound of the average oracle accuracy of human annotators, U N , with N data samples; 2. Calculate the lower bound of the model oracle accuracy L N using aggregated human annotations as the referencefoot_1 , with N data samples; 3. Check whether the finite sample margin between the bounds L N -U N is larger than zero;foot_2 4. Give proper estimation of t u and t l and calculate a confidence score of P(L -U ≥ 0). Generally, larger margin indicates higher confidence of out-performance. To formally check confidence for the aforementioned margin we provide the following theorem and corresponding algorithms. Theorem 7 (Confidence of Out-Performance) Assume an annotator pool with agreement statistic U N of equation 34, and an agreement statistic between model and aggregated annotations L N of equation 39. If L N > U N then for all τ ≥ 0, t u ≥ 0 and t l ≥ 0 that satisfy L N -t l -t u + U 2 N = τ, with probability at least 1δ uδ l , the oracle accuracy of the model exceeds that of the average annotator by τ , P P(ℓ M = ℓ ⋆ ) -P(ℓ K = ℓ ⋆ ) ≥ τ ≥1 -δ l -δ u , where δ u = exp -2N t 2 u , δ l = exp -2N t 2 l . Confidence Score Estimation. The above theorem suggests the confidence score S = 1 -δ l -δ u , and we need only choose the free constants t l , t u and τ . Recall equation 14, τ = (L N -t l ) -t u + U 2 N , and remove one degree of freedom parameterise in t u as t l (t u , τ ) = L N -τ -t u + U 2 N . We are interested in P(L -U ≥ 0) so we choose τ = 0, and give two choices for t u and t l . Algorithm 1 (Heuristic Margin Separation, HMS). We assign half of the margin to t u , t u = L N -U N 2 . ( ) Then, with τ = 0 we calculate the corresponding t l = L N - L N -U N 2 + U 2 N , and compute the heuristic confidence score S. Algorithm 2 (Optimal Margin Separation, OMS). For a locally (in t u ) optimal score, we perform gradient ascent (Lemaréchal, 2012) on S(t u ), where S(t u ) = 1 -δ(t u ) -δ(t l (t u , 0)), with t u is initialized as (L N -U N )/2 before optimizationfoot_3 .

3. EXPERIMENTS AND DISCUSSION

Previously, we introduced a new theory for analyzing the oracle accuracy of set of classifiers using observed agreements between them. In this section, we demonstrate our theory on several classification tasks, to demonstrate the utility of the theory and reliability of the associated assumptions. Our code is available at https://github.com/xuqiongkai/Superhuman-Eval.git.

3.1. EXPERIMENTAL SETUP

We first consider two classification tasks with oracle labels generated by rules. Given the oracle predictions, we are able to empirically validate the assumptions for our theorems and observe the convergence of the bounds. Then, we apply our theory on two real-world classification tasks and demonstrate that some existing state-of-the-art models have potentially achieved better performance than the (averaged) performance of the human annotators in reference to the (unobserved) oracle. Classification tasks with oracle rules. To validate the correctness of our theory, we collect datasets with observable oracle labels. We construct two visual cognitive tasks, Color Classification and Shape Classification, with explicit unambiguous rules to acquire oracle labels, as follows: • Color Classification: select the most frequently occurring color of the objects in an image. • Shape Classification: select the most frequently occurring shape of the objects in an image. For both tasks, object size is ignored. As illustrated in Figure 2 , we vary colors (Red, Blue and Yellow) and shapes (Triangle, Square, Pentagon, Hexagon and Circle) for the two tasks, respectively. For each task, we generated 100 images and recruited 10 annotators from the Amazon Mechanical Turkfoot_4 to label them. Each randomly generated example includes 20 to 40 objects. We enforce that no objects overlap more than 70% with all others, and that there is only one class with the highest count, to ensure uniqueness of the oracle label. The oracle number of the colors and shapes are recorded to generate oracle labels of the examples. Note that our U 2 is the average agreement among annotators, and so is proportional to Cohen's Kappa coefficient which we report in Appendix D along with additional details about the guidelines and interface presented to the annotators. Real-World Classification Tasks. We analyze the performance of human annotators and machine learning models on two real-world NLP tasks, namely sentiment classification and natural language inference (NLI). We use the Stanford Sentiment Treebank (SST) (Socher et al., 2013) for sentiment classification. The sentiment labels are mapped into two classes (SST-2)foot_5 or five classes (SST-5), very negative([0, 0.2]), negative ((0.2, 0.4]), neutral((0.4, 0.6]), positive ((0.6, 0.8]), and very positive ((0.8, 1.0]). We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) for NLI. All samples are classified by five annotators into three categories, i.e. Contradiction (C), Entailment (E), and Neutral (N). More details of the datasets are reported in Appendix C. In the latter part of this section, we focus on the estimated upper bounds on test sets, as we intend to compare them with the performance of machine learning models generally evaluated on test sets. Machine Learning Models. For both of the classification tasks with known oracles, we treat them as detection tasks and train YOLOv3 models (Redmon & Farhadi, 2018) for them. The input image resolution is 608 × 608, and we use the proposed Darknet-53 as the backbone feature extractor. For comparison, we train two models, a strong model and a weak model, on 512 and 128 randomly generated examples, respectively. All models are trained for a maximum of 200 epochs until convergence. During inference, the model detects the objects and we count each type of object to obtain the prediction. We compare several representative models and their variants for real-world classification tasks, such as Recurrent Neural Networks (Chen et al., 2018; Zhou et al., 2015) , Tree-based Neural Networks (Mou et al., 2016; Tai et al., 2015) , and Pre-trained Transformers (Devlin et al., 2019; Radford et al., 2018; Wang et al., 2020; Sun et al., 2020) .

3.2. RESULTS AND DISCUSSION

We now conduct several experiments to validate the convergence of the bounds and the validity of the assumptions. We then demonstrate the utility of our theory by detecting real-world superhuman models. We organize the discussion into several research questions (RQ). RQ1: Will the bounds converge given more annotators? We first analyze the lower bounds. We demonstrate lower bounds for strong (s) and weak (w) models in Figure 3 in black and blue lines respectively. Generally, i) lower bounds L N are always under the corresponding oracle accuracy; ii) the lower bounds grow and tend to get closer to the bounded scores given more aggregated annotators. Then, we analyze the upper bounds. We illustrate theoretical upper bound U (t) N and empirically approximated upper bound U (e) N , in comparison with average oracle accuracy of annotators P(ℓ K = ℓ ⋆ ), in Figure 3 . We observe that i) both upper bounds give higher estimation than the average oracle accuracy of annotators; ii) the margin between U RQ2: Are the assumptions of our theorems valid? We verify the key assumptions for the upper bound of Theorem 1 and the lower bound of Theorem 3 by computing the relevant quantities in Table 1 . The assumptions within these theorems are not concerned with agreement (or otherwise) on particular training examples (which could be unrealistic), but rather are statements BROADER IMPACT Our approach can identify classification models that outperform typical humans in terms of classification accuracy. Such conclusions influence the understanding of the current stage of research on classification, and therefore potentially impact the strategies and policies of human-computer collaboration and interaction. The questions we may help to answer include the following: When should we prefer a model's diagnosis over that of a medical professional? In courts of law, should we leave sentencing to an algorithm rather than a Judge? These questions and many more like them are too important to ignore. Given recent progress in machine learning we believe the work is overdue.

LIMITATIONS

Yet we caution that estimating a model's oracle accuracy in this way is not free. Our approach requires the results from multiple annotators and preferably also the number of annotators should be higher than the number of possible classes in the target classification task. Another potential challenge in applying our analysis is that some of our assumptions may not hold under some specific tasks or settings, e.g., collusion attack by a group of annotators. We recommend those who apply our theory where possible to collect a small amount of 'oracle' annotations, to validate the assumptions in this paper. Our work focus on multi-class classification, which only admits a single answer for each task. A multi-label classification task can be transformed to multiple binary classification tasks before using our theorem.



Here [•] is the Iverson bracket. We demonstrate that aggregating the predictions by voting and weighted averaging are effective in improving our bounds. We emphasize however that the aggregated predictions need not be perfect, as we do not assume that this aggregation yields an oracle. A larger deviation, say a high positive value, is of more interest to our certification as it gives a higher confidence score to the outperformance. We set the learning rate to 1e-4, and iterated 100 times. https://www.mturk.com Samples with overall neutral scores are excluded as in(Tai et al., 2015). Binary classification is discussed for simplicity.



Figure 2: Example a) Color Classification and b) Shape Classification. a) includes 40 objects of three colors, Red (14), Blue (15) and Yellow (11), with Blue as the most frequent color and therefore the oracle label. b) includes 37 objects of five different shapes, Triangle (9), Square (10), Pentagon (7), Hexagon (6) and Circle (5), with Square the dominant shape and oracle label.

Figure 3: Comparison of sample lower bound L N for model oracle accuracy P(ℓ M = ℓ ⋆ ). Relatively strong and weak models are indicated by M (s) and M (w) . The aggregation of one annotator is based on the labels provided by the single annotator. Another comparison of sample theoretical upper bound U (t) N and sample empirical upper bound U (e) N of average oracle accuracy of annotators P(ℓ K = ℓ ⋆ ).

as U N to calculate confidence score in later discussion.

funding

* Work was done while the authors were with the Australian National University and Data61 CSIRO.

2.2. AN UPPER BOUND FOR THE AVERAGE ANNOTATOR PERFORMANCE

The oracle accuracy of the average annotator ℓ K follows the definition of the previous section, and conveniently equals the average of the oracle accuracy of each annotator, i.e.

P(ℓ

By introducing an assumption as equation 2, we may bound the above quantity. Intuitively, annotators are likely to be positively correlated because i) they tend to have the same correct or wrong annotations on the same easy or difficult tasks respectively, ii) they may share similar backgrounds that affect their decisions, and etc. Note that this assumption is also discussed in Section 3.2 (see RQ2) where we provide supporting evidence for it on a real-world problem with known oracle labels.Theorem 1 (Average Performance Upper Bound) Assume annotators are positively correlated, P(ℓ i = ℓ ⋆ |ℓ j = ℓ ⋆ ) ≥ P(ℓ i = ℓ ⋆ ).(2) Then, the upper bound of averaged annotator accuracy with respect to the oracle isWe observe that average inter-annotator agreement will be over-estimated by including the self comparison terms P(l i = l j ), which is always equal to one when i = j, but that the total overestimation to U 2 is less or equal to 1/K (K out of K 2 terms), and that the influence will reduce and converge to zero as limit K → ∞. To provide a more practical estimation, we introduce an empirically approximated upper bound U (e) . In contrast, U in equation 3 is also noted as theoretical upper bound, U t) . Definition 1 The empirically approximated upper bound,(4)Lemma 2 (Convergence of U (e) ) Assume that K j=1,i̸ =j P(ℓ i = ℓ j ) ≥ K-1 Nc , where N c is the constant number of classes. The approximated upper bound U (e) satisfies lim K→+∞ U/U (e) = 1.(5)Therefore, with large K, U (e) converges to U or U (t) .Empirical support for the convergence of U (e) to U (t) are demonstrated in Figure 3 of Section 3.2.

2.3. A LOWER BOUND FOR MODEL PERFORMANCE

For our next result, we introduce another assumption as equation 6. Given two predicted labels ℓ a and ℓ b , we assume that ℓ b is reasonably predictive even on those instances that a gets wrong, as per the assumption formally stated within the following theorem. Note that this assumption is rather mild in that even random guessing satisfies it, as in this case the probability of choosing the correct label is equal to any other single wrong label. Once again, this key assumption is discussed and validated on human data with known oracle labels in Section 3.2 (see RQ2).Theorem 3 (Performance Lower Bound) Assume that for any single incorrect label(6) Then, the lower bound for the oracle accuracy of ℓ b isIn practice, a more accurate ℓ a gives a tighter lower bound for ℓ b , and so we employ the aggregated human annotations for the former (letting ℓ a = ℓ G ) to calculate the lower bound of the machine learning model (letting ℓ b = ℓ M ), as demonstrated in Section 3.2.Published as a conference paper at ICLR 2023 Task1.000 0.000 Color M (s) 1.000 0.000 Shape M (w) 0.579 0.421 Shape M (s) 0.895 0.105Theorem 3 assumes that on average, over all classifier inputs and class labels, if the majority vote by the human is incorrect w.r.t. the oracle, then the machine learning model is still more likely to predict the oracle label than any other specific label that disagrees with the oracle. The two assumptions clearly hold in our specially designed experiments with real human subjects, although we can only perform this analysis on the tasks with known oracle labels. However, the methodology behind Table 1 is by design rather conservative, as we sum over all incorrect labels (see column 2 of Table 1 .b). Despite this stricter setup, our assumption still holds on both experiments.Disclaimer: while the assumptions appear reasonable, we recommend where possible to obtain some oracle labels to validate the assumptions when applying our theory. RQ4: How confident are the certifications? We calculate our confidence score for the identified outperforming models via U N , L N , N , and using HMS and OMS, as reported in Table 3 . Generally, the confidence scores for SNLI models are higher than those of SST-2 because the former has test set is more than five times larger, while more recent and advanced models achieve higher confidence scores as they have larger margin of L N -U N .

4. RELATED WORK

Classification accuracy is a widely used measure of model performance (Han et al., 2011) , although there are other options such as precision, recall, F1-score (Chowdhury, 2010; Sasaki et al., 2007) , Matthews correlation coefficient (Matthews, 1975; Chicco & Jurman, 2020) , etc.. Accuracy measures the disagreement between the model outputs and some reference labels. A common practice is to collect human labels to treat as the reference. However, we argue that the ideal reference is rather the (unobserved) oracle, as human predictions are imperfect. We focus on measuring the oracle accuracy for both human annotators and machine learning models, and for comparing the two.A widely accepted approach is to crowd source (Kittur et al., 2008; Mason & Suri, 2012) a dataset for testing purposes. The researchers collect a large corpus with each examples labeled by multiple annotators. Then, the aggregated annotations are treated as ground truth labels (Socher et al., 2013; Bowman et al., 2015) . This largely reduces the variance of the prediction (Nowak & Rüger, 2010; Kruger et al., 2014) , however, such aggregated results are still not oracle, and their difference to oracle remains unclear. In our paper, we prove that the accuracy on aggregated human prediction, as ground truth, could be considered as a special case of the lower bound of oracle accuracy for machine learning models. On the other hand, much work considers the reliability of collected data, by providing the agreement scores between annotators (Landis & Koch, 1977) . Statistical measures for the reliability of the inter-annotator agreement (Gwet, 2010), such as Cohen's Kappa (Pontius Jr & Millones, 2011) and Fleiss' Kappa (Fleiss, 1971) , are normally based on the raw agreement ratio. However, the agreement between annotators does not obviously reflect the oracle accuracy; e.g. identical predictions from two annotators does not mean they are both oracles. In our paper, we prove that observed agreement between all annotators could serve as an upper bound for the average oracle accuracy of those annotators. Overall, we propose a theory for comparing the oracle accuracy of human annotators and machine learning models, by connecting the aforementioned bounds.The discovery that models can predict better than humans dates back at least to the seminal work (Meehl, 1954) , which compared ad hoc predictions based on subjective information, to those based on simple linear models with a (typically small) number of relevant numeric attributes. Subsequent work found that one may even train such a model to mimic the predictions made by the experts (rather than an oracle), and yet still maintain superior out of sample performance (Goldberg, 1970) . The comparison of human and algorithmic decision making remains an active topic of psychology research (Kahneman et al., 2021) . Despite this, much work continues to assume without formal proof that the inter-annotator agreement gives an upper bound on the achievable machine learning model performance (Boguslav & Cohen, 2017; Richie et al., 2022) ; the mounting empirical evidence against which is now placed on a solid theoretical footing by the present work.

5. CONCLUSIONS

In this paper, we built a theory towards estimating the oracle accuracy of classifiers. Our theory covers i) the upper bounds for the average performance of human annotators, ii) lower bounds for machine learning models, and iii) confidence scores which formally capture the degree of certainty to which we may assert that a model outperforms human annotators. Our theory provides formal guarantees even within the highly practically relevant realistic setting of a finite data sample and no access to an oracle to serve as the ground truth. Our experiments on synthetic classification tasks validate the plausibility of the assumptions on which our theorems are built. Finally, our meta analysis of existing progress succeeded in identifying some existing state-of-the-art models have already achieved superhuman performance compared to the average human annotator.

A PROOF FOR THEOREMS AND LEMMAS

Proof of Theorem 1 (Average Performance Upper Bound) Proof For i ̸ = j and i, j ∈ {1, • • • , K}, we haveWhile for i = j, we have P(ℓ i = ℓ j ) = 1. Therefore,Then, combining equation 23 and equation 24,Proof of Theorem 3 (Performance Lower Bound)ProofProof of Lemma 2 (Convergence of Empirically Approximated Upper Bound)Proof By comparing the upper bound and empirical upper bound, we haveFor the first factor in equation 29,For the second factor in equation 29, as both annotators address the same task, the annotator agreement should be better than guessing uniformly at random, i.e. P(ℓ i = ℓ j ) ≥ 1/N c , where N c is the number of categories in the classification task. Then, using a looser constraintCombining equation 30 and equation 31, we haveTherefore, the empirically approximated upper bound converges to the theoretical upper bound when K grows larger.Proof of Theorem 5 (Sample Average Performance Upper Bound) Proof We apply Theorem 4 withobtaining X n ∈ [0, 1], i.e. α = 0, and β = 1. LetOur choice equation 33 of X n implies U 2 N = X and U 2 = E[X], and so by equation 8,Rewrite equation 3 aswhich impliesCombining equation 35 with equation 37 gives the result.Proof of Theorem 6 (Sample Performance Lower Bound) Proof We apply Theorem 4 withobtaining X n ∈ [0, 1], i.e. α = 0, and β = 1. LetNow equation 38 implies L N = X andRecall equation 7, P(ℓ a = ℓ b ) ≤ P(ℓ b = ℓ ⋆ ), which impliesCombining equation 40 with equation 41 gives the result.Proof of Theorem 7 (Confidence of Out-Performance)Proof Recall Theorem 5 and Theorem 6,Then, we have

B AN EXAMPLE FOR THE ASSUMPTIONS

Here, we provide a running example to show that both assumptions for Theorem 1 and 3 could reasonably hold with no conflict. A common example is demonstrated in Table 4 . 7 In this case, all annotators did a decent job (generally more correct than incorrect in all conditions). For the more challenging condition (other annotators fail), the ratio of correct performance is slightly less, see the rows in Table 4a . For assumption 2, all possible inequations hold:Note that, if b should be a decent ML model or a rational annotator who works better than random guessing, i.e., 0.5 SST-2 (Socher et al., 2013) 1,821 2 3 SST-5 (Socher et al., 2013) 2,210 5 3 SNLI (Bowman et al., 2015) 10,000 3 5

D DETAILS FOR HUMAN ANNOTATION

We crowd source the annotations via the Amazon Mechanical Turk. The annotation interfaces with instructions for color classification and shape classification are illustrated in Figure 4 . Each example is annotated by K = 10 different annotators. For quality control, we i) offer our tasks only to experienced annotators with 100 or more approved HITs; ii) automatically reject answers from annotators who have selected an invalid option 'None of the above'.We demonstrate the inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa and Krippendorff's Alpha) of collected annotations on Color and Shape, in Table 6 . Note that Cohen's Kappa compares only two annotators. We calculate the mean of Cohen's Kappa scores between all K(K -1)/2 different pairs of annotators. The results show that our collected human annotation datasets cover the cases for both strongly (Color) and weakly (Shape) correlated human annotations. 

