ELICITATION INFERENCE OPTIMIZATION FOR MULTI-PRINCIPAL-AGENT ALIGNMENT

Abstract

In multi-principal-agent alignment scenarios including governance, markets, conflict resolution, and AI decision-making, it is infeasible to elicit every principal's view on all perspectives relevant to an agent's decisions. Elicitation inference optimization (EIO) aims to minimize the n elicitations needed to approximate N principal's views across K perspectives. In this work, we demonstrate an EIO approach where data efficiency (N K/n) increases with scale. We introduce STUMP: an elicitation inference model which integrates a large language model with a latent factor model to enable learning transfer across samples, contexts, and languages. We characterize STUMP's performance on a set of elicitation primitives from which scalable elicitation (sampling) protocols can be constructed. Building from these results, we design and demonstrate two elicitation protocols for STUMP where, surprisingly, data efficiency scales like O(n) in the number of elicitations n. In other words, the number of elicitations needed per principal remains constant even as the number of perspectives and principals grows. This makes it possible to approximate complex, high-dimensional preference signals spanning principal populations at scale.

1. INTRODUCTION

The principal-agent problem involves aligning agent decisions with principal interests. A challenge is creating situations where agent choices are sufficiently influenced by signals containing principal preferences. With a single principal, high-complexity preference signals can be elicited directly via open-ended interaction. Multi-principal-agent scenarios can involve large populations of principals and powerful agents such as: governments & citizens Giger and Lefkofridi (2014) ; Gabriel (2020) , firms & customers Roberts and Grover (2012) , peacekeepers & conflict parties United Nations (2012), existing AI systems & impacted populations Prabhakaran et al. (2022) , and potentially even transformative AI & humanity Russell et al. (2015) ; Christiano et al. (2017) . As the number of principals grows, and the domain of agent decisions becomes open-ended, directly eliciting the preference of all principals on all relevant perspectives becomes unfeasible [A.1] . As a result, lowercomplexity forms of elicitation like ballot voting (ie. for governments) and price signals (ie. for firms) are used to learn preferences. While clearly effective-a basis of democracy and the economy-these approaches drastically simplify real preferences. For example, they do not allow citizens (principals) to express what they would like a government (agent) to do, or why, only if they support predefined options. Elicitation inference optimization (EIO) aims to decrease the amount of direct elicitation needed to recover a preference signal (enabling the use of more complex, higher-dimension preferences; e.g. in natural language). Consider an N × K matrix Θ where rows correspond to N principals, columns correspond to K perspectives, and every element captures a principal-perspective relationship. The goal of elicitation inference optimization is to obtain a sufficient approximation of Θ with a minimal elicitation budget by directly sampling some elements and inferring the rest. Thus, EIO involves combining a) a sparse elicitation (sampling) protocol with, b) an elicitation inference model. Closed-ended surveys simplify EIO by constraining the set of relevant perspectives to a predefined set -typically, with K << N . Matrix sampling techniques elicit responses from each participant on a subset of perspectives selected randomly Shoemaker (1973) , heuristically Raghunathan and Grizzle (1995) , or dynamically such that inference accuracy is adaptively optimized Gonzalez and Eltinge (2008) ; Zhang et al. (2020) . Inference models exploit the low-rank-ness of Θ via matrix factorization and other collaborative filtering techniques Zhang et al. (2020) ; Sengupta et al. (2021) ; Oliveira et al. (2021) . Collective response systems (CRS) enable high-complexity signals by allowing the set of relevant perspectives to be open-ended: neither limited in number nor requiring pre-definition Ovadya (2022) . In a CRS, participants contribute open-ended perspectives in the context of a question, prompt, or conversation and respond to subsets of perspectives contributed by others -typically, yielding K ∼ O(N ). Inspired by preference models Thurstone (1927) and conjoint analysis Green and Srinivasan (1978) , one class of approaches elicits randomized pair-choice votes Konya and Slodov (2015) and uses hierarchical probit-like models for inference Salganik and Levy (2015) . Other approaches elicit agreement votes to learn absolute human-perspective relationships. Polis samples agreement votes semi-randomly, prioritizes votes that improve clustering, and uses mean imputation to support dimensionality reduction Small et al. (2021) . Additional work bridges these approaches by randomly sampling both pair-choice and agreement votes; inferring missing votes by modeling both vote types with a single utility matrix learned via regularized matrix completion Bilich et al. (2019) . Previous approaches to EIO for CRS have been limited in scope to single samples where the set of relevant perspectives all correspond to the same context (ie. the same question or prompt) Salganik and Levy (2015); Konya and Slodov (2015) ; Small et al. (2021) ; Bilich et al. (2019) . This means votes must be elicited from every person for every new context. As a result, to approximate some Θ spanning an arbitrary number of contexts, the number of direct elicitations needed grows proportional to the number of elements in Θ [A.2] -data efficiency does not increase with scale. However, previous approaches do not fully leverage all available data: learning from one context is not transferred to support inference in other contexts, and the information contained in the perspective text is not used at all. Can an EIO approach that better leverages available data, enable increasing data efficiency with scale? In this work we introduce an approach to EIO for CRS which becomes increasingly data efficient as the amount of data elicited grows. First, we introduce a novel elicitation inference model -STUMPwhich better leverages available data by integrating a pre-trained LLM with a latent factor model. Second, we characterize STUMP's performance across a range of elicitation primitives from which arbitrarily scalable elicitation (vote sampling) protocols can be constructed. Finally, building on these results, we design and demonstrate two scalable elicitation protocols where STUMP can infer with meaningful accuracy while data efficiency increases linearly with scale.

2. PROBLEM SETUP

Elicitation form. In a collective response system (CRS) participants respond to prompts with open-ended perspectives and vote on perspectives submitted by others. Let H be the set of N human participants, P be the set of K perspectives, and Θ be the corresponding N × K humanperspective matrix. Both agreement and pair-choice votes may be elicited on perspectives. In agreement exercise e a ij , participant i is asked if they agree with perspective j. In pair choice exercise e c ijk , participant i is asked if they prefer perspective j or k. We denote a set of exercise data E = {e a ij , e c ijk | i ∈ H; j, k ∈ P }. Elicitation inference involves training a model on available data and predicting all missing elements in a target Θ. We choose binary agreement as the human-perspective relationship for our target Θ. Inference performance is probed experimentally by training on a subset of exercises including both vote types E t , and computing prediction accuracy on a validation set of only agreement votes E v . Data efficiency in EIO can be quantified as the ratio of a) the number of votes contained in the target Θ and b) the number of votes directly elicited via a sampling protocol which a model needs to be trained on to meaningfully approximate Θ. We denote the data efficiency to achieve inference accuracy acc as: where n(X) denotes the size of set X and n t ≡ n(E t ). We refer to the rate data efficiency changes with more elicited votes (∂β acc /∂n t ) as data leverage. EIO becomes more data efficient with scale if data leverage is positive for arbitrarily large n t above a sufficient threshold of accuracy. To achieve this, the challenge is to combine: a) a scalable elicitation (sampling) protocol where N K grows faster than n t , and b) an inference model that sustains meaningful inference accuracy at scale using the data the protocol generates. β acc = n(Θ) n(E t ) = n(H) × n(P ) n(E t ) = N K n t

3. ELICITATION INFERENCE MODEL

Elicitation inference (EI) is analogous to collaborative filtering (CF) in recommenders, except instead of items people may prefer, there are perspectives people may agree with. A standard approach to CF is to learn a low-rank approximation of a user-item preference matrix by optimizing an objective function against observed preferences (Johnson, 2014) . Bilich et al. (2019) demonstrated a similar approach for EI on CRS data. They use regularized matrix completion with a low-rank-promoting nuclear norm prior similar to Fazel et al. (2001) to learn a utility matrix that predicts votes. We refer to this as the nuclear norm model (NN), and adopt it as a baseline (see A.3 for technical details). Latent factor models (LFMs) learn vectors for each user z i and item w j which embed them in the same latent space, and describe the preference of user i for item j as a function of u ij = z i • w j (Johnson, 2014). Known item features (i.e., director of a movie) can be used to improve performance by projecting features into latent space as a component of item embeddings (Zhang et al., 2021) . Relevant features can also be extracted from item content (i.e., a song's time sequence) by a link to latent space embeddings (van den Oord et al., 2013) . We consider EI on text-based perspectives. Collaborative Topic Ranking (CTR) uses information in text by learning a topic model on item texts and equating topic feature vectors to latent space embeddings (Xin et al., 2015) . Pre-trainable large language models (LLMs) (Cer et al., 2018; Devlin et al., 2018; Yang et al., 2019) learn to encode information from text into embeddings suitable for a wide range of downstream tasks (Men; Hao et al., 2020; Schick and Schütze, 2021) . Our approach to using the information in perspective texts is to start with a pre-trained LLM, then similar to Zhang et al. (2021) , learn a transformation of text embeddings into latent space. A key challenge in CF is maintaining alignment between latent spaces learned from non-overlapping data (Cremonesi and Quadrana, 2014) . Sharing the same embedding model and transformation across perspectives enforces alignment even in the absence of overlapping data (see section 4.3). Cross-Domain recommendation (CDR) involves aligning latent spaces across items in different domains like movies and songs (Zang et al., 2022) . One class of approaches uses information like tags that can be shared across domains (Sahu et al., 2018) . Others focus on aligning latent space distributional properties (Liu et al., 2021) . In EI the analog to CDR is multi-context inference. How one votes on a perspective depends on the context it was elicited in. Take the perspective "I feel happy." Consider two scenarios, where it was elicited by asking "What would you say if you saw your parents smile?" or "What wouldn't you say if you saw your parents smile?". One would expect nearly opposite voting behavior. But, if the same embedding model and transformation are shared in both contexts, then predictions will be the same. Enabling multi-context inference requires relaxing this condition. In the spirit of Sahu et al. (2018) our approach is to share the same LLM across contexts, but learn a unique transformation into latent space for each context. Our model. We formally introduce our model, the Semantic Transfer Utility Model of Participants: STUMP. We denote the event that participant i reports agreement with perspective j by a ij , and disagreement by d ij . We denote the event that participant i reports preference for perspective j over perspective k by c ijk . We assume a logit utility model of choice similar to Johnson (2014); Bilich et al. (2019) where participant i has a latent utility u ij corresponding to perspective j and voting probabilities are given by p( a ij |u ij ) = σ(u ij ), p(d ij |u ij ) = σ(1 -u ij ) and p(c ijk |u ij ) = σ(u ij -u ik ) , where σ denotes the logistic function. The utility u ij for participant i and perspective j is constructed as an LFM: u ij = z i • w j where z i is a latent embedding for participant i, and w j is a latent embedding for perspective j. To use information in perspective text, we employ a pre-trained LLM to generate text embeddings. We denote the embedding of the j th perspective text p j by model F with parameters ϕ, as: e j = F (p j |ϕ). Text embeddings are mapped into latent space by context-specific transformations: w j = tanh(M q e j + b q ) where M q and b q are the transformation weight and bias for context q (see Figure 1 ). Finally, let A = {i, j|a ij }, D = {i, j|d ij }, and C = {i, j, k|c ijk }. The log loss is defined by: L = -( i,j,k∈C log(σ(u ij -u ik )) + i,j∈A log(σ(u ij )) + i,j∈D log(1 -σ(u ij ))) + λ m ∥M∥ F where ∥.∥ F denotes the Frobenius norm of a matrix, and λ m is a hyper-parameter controlling regularization strength. The participant embedding vectors z and the set of transformation parameters for each context {M q , b q } are learned by minimizing the log-loss for observed data. Optionally, the LLM parameters ϕ may be fine-tuned through the same minimization.

4. ELICITATION PRIMITIVES

Our strategy for elicitation protocol design begins with characterizing STUMP's performance across a range of elicitation primitives from which arbitrarily scalable protocols can be constructed. To this end, we explore four types of elicitation primitives arising from two orthogonal data set properties: context homogeneity and sample overlap. Context homogeneity. Perspectives and votes are elicited under some context -for example, in response to a question. Thus, a data set (or Θ) may correspond to one or more contexts. We denote the set of perspectives corresponding to the q th context as P q ⊆ P .

Sample overlap.

Vote data E may come from one or more samples, where each s th sample involves a subset of humans H s ⊆ H and a subset of perspectives P s ⊆ P . Two samples are said to have no overlap when they share neither humans nor perspectives, ie H 1 ∩ H 2 = ∅ and P 1 ∩ P 2 = ∅. A sample with overlap may correspond to a data set with only a single sample, or with multiple samples where either humans or perspectives are shared, ie H 1 ∩ H 2 ̸ = ∅ or P 1 ∩ P 2 ̸ = ∅. Permuting these two properties gives four types of elicitation primitives (Figure 2 , Table 1 ):  = P and H 1 ∩ H 2 ̸ = ∅ or P 1 ∩ P 2 ̸ = ∅ II Single No overlap P q = P and H 1 ∩ H 2 = P 1 ∩ P 2 = ∅ III Multiple Overlap P q ̸ = P and H 1 ∩ H 2 ̸ = ∅ or P 1 ∩ P 2 ̸ = ∅ IV Multiple No overlap P q ̸ = P and H 1 ∩ H 2 = P 1 ∩ P 2 = ∅

4.1. ELICITATION PRIMITIVE EXPERIMENTS

Baselines. We use four prediction baselines to contextualize STUMP's performance: a) a Random baseline where votes are predicted randomly with uniform probability, b) a Most Common (MC) baseline where votes are predicted to be the most common choice observed for each perspective, c) the nuclear norm (NN) model from Bilich et al (Bilich et al., 2019) [A.3] which represents a standard approach to CF and is the only prior work we are aware of focused on single-vote-level inference for CRS, and d) the classic matrix factorization (CF-MF) model widely used in recommendation systems (Koren et al., 2009) . Data and implementation. Data used for experiments was collected as part of regular load testing for an existing CRS product and de-identified prior to use in this study. Data came from 534 participants solicited through online research panels. We use data corresponding to seven questions [A.4] asked during the CRS process, chosen for their nuance and presumed hard-to-predict nature. Each question is treated as a unique context, meaning only perspectives and votes elicited in response to the same question share the same context transformation weights. The full dataset used consists of 172,242 votes across 3,703 perspectives made by 534 humans. Experiments are constructed by partitioning exercise data E into training and validation sets (E t and E v ). In each of the primitives described below, the inference model is trained with E t and evaluated with E v . Data and training details are described in [A.5] .

4.2. TYPE I: SINGLE CONTEXT, OVERLAPPING SAMPLE

Data elicited from one question during a collective response (CR) represents a type I primitive. We test type I model performance as a function of vote per participant (VPP) for two modes of CR; asynchronous and synchronous. In an asynchronous CR, participation happens on a rolling basis, so participants only vote on perspectives submitted prior to their participation. In a synchronous CR, all participation happens in the same minute-scale window of time, so any perspective may be voted on by any participant. Type I as a baseline. We simulate both cases while varying VPP by partitioning the appropriate votes into training and validation sets. To simulate the asynchronous CRS scenario, participants' votes pertaining to perspectives submitted after their own become the validation set E v , and votes pertaining to perspectives submitted before their own become the training pool. We modulate vote density by adjusting the fraction of votes from the training pool which is randomly partitioned into the training set E t . To simulate the synchronous CRS scenario, we randomly partition votes such that the training set E t and validation set E v are the same size as those in the asynchronous scenarios for each experiment. For each experiment, we train the models for each question and compute the corresponding validation accuracy. STUMP outperforms all baselines at < 30 VPP and achieves a level of accuracy at only ∼ 10 VPP which the NN baseline only achieves with 3x the amount of data (Figure 3 ). As the simplest elicitation primitive, Type I performance across simple baselines informs what constitutes meaningful accuracy in general. The Random baseline achieves 50% accuracy (not shown) and the Most Common baseline saturates at an average of ∼ 64% accuracy with a high degree of variance. A reasonable conclusion is that 50 -64% accuracy is not meaningful, 65 -69% is arguably meaningful, and 70% or more is meaningful.

4.3. TYPE II: SINGLE CONTEXT, NON-OVERLAPPING SAMPLE

Type I to Type II transition. We simulate EI within a single-question context as the overlap between two samples decreases to zero. For each question, we randomly split the participants and perspectives into two equal sized samples: {H s1 , P s1 } and {H s2 , P s2 }. Agreement votes by participants in H s1 on perspectives in P s2 become the validation set E v = {e a ij , e c ijk | i ∈ H s1 , j, k ∈ P s2 }. The remaining votes contained in the question become the Type I training set E t , where sample overlap is maximum, E t = {e a ij , e c ijk | i ∈ H, j, k ∈ P s1 } ∪ {e a ij , e c ijk | i ∈ H s2 , j, k ∈ P s2 }. Sample overlap is decreased by removing votes on perspectives in P s1 by participants in H s2 until none remain, giving Type II training set, E t = {e a ij , e c ijk | i ∈ H s1 , j, k ∈ P s1 } ∪ {e a ij , e c ijk | i ∈ H s2 , j, k ∈ P s2 } more experimental details are described [A.6.1]). We define sample overlap p as the fraction of participants in H s2 with training votes in P s1 , and compute validation accuracy for STUMP, NN and CF-MF baseline as p varies (Figure 4 ). It is shown that NN and CF-MF baselines are sensitive to sample overlap. In contrast, STUMP's performance remains stable even at p = 0; evidence of STUMP's ability to maintain latent-space alignment across non-overlapping samples. Type II with Different languages. We investigate STUMP's inference capability across two nonoverlapping samples in different languages. Perspectives in one sample are translated into one of nine different languages using Google Translate API: German (de), Spanish (es), French (fr), Italian (it), Japanese (ja), Dutch (nl), Portuguese (pt), simplified Chinese (zh) and traditional Chinese (zhtw) [A.6.1]. We compute STUMP's Type II accuracy for each language pair and compare it to an English-English baseline (Figure 5 ). Performance across all language pairs is on par with the baseline. 

4.4. TYPE III: MULTIPLE CONTEXT, OVERLAPPING SAMPLE

Transfer learning. To test STUMP's capability for transfer learning, we pre-train participant embeddings in one context and use them for inference in a new context. Given a pair of contexts Q 1 and Q 2 , we denote all perspectives in Q 1 as P 1 and all perspectives in Q 2 as P 2 . We split the participants who participated in both contexts randomly into two groups H s1 and H s2 . The pre-training set contains votes by all participants on perspectives in context .6.2] ). We vary the participant fraction r from 0.05 to 0.9, and compute validation accuracy for STUMP and the CF-MF and NN baselines (Figure 6 ). STUMP performance increases with participant fraction when it is small, leveling off at r ∼ 0.2, while the NN baseline only reaches similar performance around r ∼ 0.8. The CF-MF baseline saturates at lower accuracy. In other words, STUMP can leverage data from a different context via transfer learning to achieve the same performance as the NN baseline with 75% fewer data. Q 1 , E t1 = {e a ij , e c ijk | i ∈ H, j, k ∈ P 1 }, Increased pre-training data. Next, we investigate how more pre-training data from more contexts affects STUMP's transfer learning performance. We increase the number of question contexts whose votes are included in the pre-training set [A.6.2] and compute type III validation accuracy at different participant fractions (Figure 7 ). STUMP's transfer learning performance increases as the number of question contexts included in pre-training increases.

4.5. TYPE IV: MULTIPLE CONTEXT, NON-OVERLAPPING SAMPLE

Type III to Type IV transition. Finally, we explore transfer learning across contexts as sample overlap decreases. Let sample overlap p be the fraction of participants in the training sample (context Q 2 ) who are also in the pre-training sample (context Q 1 ). Moving from Type III to Type IV means decreasing p from 1 to 0. To simulate this, we start with a Type III setup and iteratively remove participants from the pre-training sample who are also in the training sample (more experimental details are described in [A.6.3] ). We use the training procedure and validation set from [4.4] and compute accuracy as p varies (Figure 8 ). STUMP performance is arguably meaningful when sample overlap is larger than 50% but drops below that.

5. INCREASING DATA EFFICIENCY WITH SCALE

Based on STUMP's performance characteristics across elicitation primitives, we design two arbitrarily scalable elicitation protocols where: a) one may expect STUMP to sustain meaningful inference accuracy, and b) data leverage is positive, ie. ∂β ∂nt > 0. Case I. STUMP shows meaningful type III inference for low participant fractions (Figure 6 ). This means if participants all vote on perspectives from a shared context, only a small fraction need to vote on perspectives from a new context to meaningfully infer the rest. Thus, the following elicitation protocol is viable: Elicit an initial sample where m participants each give 1 new perspective and vote on c other perspectives for a single context. Let each sample after the first be elicited from m new participants where each a) provides 1 new perspective and votes on c other perspectives for a new context, and b) votes on c perspectives for the context shared by all samples (Figure 9 ). This protocol results in data efficiency of β = n t /(4c 2 ) + m/(2c) + m/(4n t ) ∼ O(n t ), and data leverage ∂β ∂nt ≈ 1/(4c 2 ) > 0. [A.7.1] Case II. STUMP shows arguably meaningful type III/IV inference once sample overlap approaches ∼ 0.5 (Figure 8 ). In other words, inference for a population of participants across two contexts requires around half of them to vote on perspectives from both contexts. This implies such an overlap supports latent space alignment between contexts. To maintain latent space alignment across an arbitrary number of contexts, we chain successive overlaps together with the following elicitation protocol: Elicit an initial sample where m participants each provide 1 new perspective and vote on c other perspectives for a single context. Let each sample after the first be elicited from m participants where m/2 participants are new, and m/2 are from the previous sample. Let each participant provide 1 new perspective and vote on c other perspectives for a new context (Figure 10 ). This protocol results in data efficiency of β = n t /(2c 2 ) + m/(2c) ∼ O(n t ), and ∂β ∂nt = 1/(2c 2 ) > 0. [A.7.2] We simulate EI under both protocols for q = 2, 3, 4 samples by partitioning the appropriate perspectives, participants, and votes into training and validation sets [A.8] . In both cases STUMP sustains a meaningful accuracy of ∼ 70% as q increases (Figure 11 ), approximating O(n 2 t ) votes using only O(n t ). Thus, the combination of either elicitation protocol with STUMP achieves β 70 ∼ O(n t ) (Figure 12 ), giving positive data leverage: ∂β70 ∂nt = const > 0. In other words, data efficiency increases linearly with scale.

6. DISCUSSION

Improving inference accuracy. This work prioritized data efficiency with scale over maximizing inference accuracy. We do not expect STUMP to achieve SOTA accuracy in its current form. Neural collaborative filtering (NCF), which replaces the inner product between latent space embeddings with a neural architecture able to learn a more general function, has been shown to improve performance in recommenders He et al. (2017) , and is likely to improve accuracy in this setting as well (preliminary experiments have confirmed this). Experiments in (Section 4.4) indicate more, and more diverse, pre-training data can improve accuracy. STUMP's use of information in the perspective text is limited by the LLM it employs and we expect a SOTA LLM Chowdhery et al. (2022) ; Zhang et al. (2022) ; Mitchell et al. (2022) can improve accuracy. Other directions to improve accuracy may include: using participant side information like demographics, extending STUMP to use information in prompts, using active sampling to elicit votes expected to give the most information Gonzalez and Eltinge (2008); Zhang et al. (2020) , and fine-tuning LLM weights while training STUMP. These will be our future work directions. Limitations. STUMP fails to align latent spaces across non-overlapping samples from different contexts. This may require an embedding distribution alignment module and Stien Path training Liu et al. (2021) . STUMP makes assumption of a low-rank structure and and does not model non-linearity. Our work does not quantify confidence in aggregate vote statistics, though Bilich et al. (2019) shows stochastic weight averaging Maddox et al. (2019) can estimate the posterior variance in this setting. We only use text perspectives. Handling perspectives spanning text, images, audio, and video requires swapping STUMP's LLM for a multi-modal foundation model able to generate aligned embeddings across media types Jabbar et al. ( 2022); Radford et al. (2021) ; Bommasani et al. (2021) ; Song et al. (2022) ; Reed et al. (2022) . Our experiments were limited by data which was available, and it is possible that some kinds of contexts, perspectives, or participants are materially different in ways that make our results generalize less than expected. STUMP can only represent changing participant views by treating different times as different contexts, while recommenders often model it explicitly Chen et al. (2021a) . This work does not show how to make results understandable for human agents, though clustering in latent space is a promising direction Small et al. (2021) . We don't show how learned preference signals can directly control other ML models, but jury learning for classifiers Gordon et al. (2022) and reward learning for RL agents Ziegler et al. ( 2019) are encouraging directions. STUMP is not differentially private Dwork (2008) but approaches developed for LFMs in recommenders may be applicable Berlioz et al. (2015) . Finally, elicitation inference is just one of the components helpful for achieving the underlying goals of multi-principal-agent alignment. Risks. In CRS,some biases common to recommenders Chen et al. (2020) are less significant due to the structure of the task. In particular, if vote exercises are random and can't be skipped, while users are anonymous and can't see results, then biases spanning selection, exposure, conformity, position, and popularity are minimized. However, participants who respond and vote more slowly may still be under-represented in the data, bad elicitation prompts can still bias participants Litwak (1956) , and malicious actors may still attempt to influence results through coordinated participation. Biases in pre-trained LLMs Schramowski et al. (2022) ; Nadeem et al. (2021) ; Liu et al. (2022) may also propagate through the model to effect inference in unknown ways. We avoid active sampling Zhang et al. (2020) due to the risk of compounding small model bias into large inference bias. Further, non-representative participant populations can arise from biased solicitation, or unequal willingness or ability to participate. As such systems become used for important decision-making, they risk neglecting views of already marginalized communities due to limited technology access and capacity Roessler (2018) unless specifically mitigated for, e.g. using existing inclusive voting infrastructure, structurally increasing digital equity, and/or sortition with high touch support. Even if the data collected is unbiased and representative, there are still risks that unexpected inference errors can corrupt results, or that even high-quality results may be misinterpreted or misused due to miscalibrated confidence, missing context, or malice. Elicitation inference more generally enables a different set of tradeoffs than traditional approaches to vote collection and aggregation. Its efficiency gains can enable increased access to agenda-setting power compared to a traditional vote-with the tradeoff that this involves accepting approximation and error bars in vote outcomes.

Conclusion.

We have demonstrated EIO for collective response systems where data efficiency increases linearly with the size of the elicited data set. As a result, the number of elicitations needed per participant remains constant even as the number of perspectives and participants grows. Our experiments suggest this scaling relationship is likely to hold for data sets spanning contexts, languages, and population samples. If true, that would make it feasible to approximate highcomplexity preference signals for massive principal populations spanning open-ended perspectives across an agent's decision domain. Such signals may be used to steer complex objective and evaluation functions for multi-principal-agent scenarios spanning governance, markets, diplomacy, and AI system alignment. For example, to support collective response for peacebuilding Alavi et al. (2022) and constructive democratic governance of global platforms Ovadya (in combination with other processes and tools). More speculatively, EIO may help ensure that the quality and complexity of human input driving the actions of powerful automated systems increases with the capabilities of those systems [A.9].

A APPENDIX

A.1 ELICITATION FEASIBILITY Assume a scenario where there are 10,000 principals and an agent with an open-ended decision space such that each principal may provide a unique perspective related to decisions the agent could make in a given month. Consider 10,000 perspectives, and assume that directly eliciting a principal's view on a perspective involves casting a single vote which takes 10 seconds each. Direct elicitation of all votes equates to 28 hours spent voting per principal per month ( 280,000 hours per month in total). This is unreasonable but theoretically possible. However, increasing this to just 100,000 principals equates to 280 hours per principal per month -more than a full-time job -, and increasing it to 1,000,000 principals equates to 2,800 hours per principal per month which is more than the number of hours in a month. There are more than 300,000,000 principals (US citizens) in the US government agent.

A.2 SCALING BEHAVIOR OF EIO FOR CRS BASED ON PREVIOUS APPROACHES

Consider a human-perspective matrix Θ corresponding to N humans and K perspectives spanning Q different contexts (ie. elicited by Q different questions or prompts). Assume the corresponding CRS generates a finite number of k perspectives for each context so K = Qk. The number of elements in Θ is N Qk ∼ O(N Q). Bilich et. al. demonstrate an EIO approach to CRS in (Bilich et al., 2019) where only a finite number of votes per person, c, is needed to recover an approximation of Θ for a single context. This means the number of votes which need to be sampled for a Θ spanning Q contexts is N Qc ∼ O(N Q).

A.3 NUCLEAR NORM MODEL DETAILS

The nuclear norm model likelihood is given by: p(A, D, C | u) = i,j∈A σ(u ij ) i,j∈D (1 -σ(u ij )) i,j,k∈C σ(m ij -m ik ) normative alignment signals improves as the agent learns. As an example, this could happen in a scenario where the following are true: 1. The quality of perspective representations (e.g. embeddings) produced by the shared FM, and used by the EIO model, improves as the RL agent learns. 2. The performance (e.g. inference accuracy or data efficiency) of EIO increases as the quality of perspective representations improves. A.10 PARTICIPANT PAYMENT AND CONSENT The 534 participants who joined the load testing exercise had been solicited through online panel providers for an incentive of approximately $15/hour USD, and consented to the policy shown in Figure 14 . While this consent form does mention PII, this was a general consent form for the data provider and no PII was known to be contained in the data set made available for this research. The researchers were structurally limited from accessing any directly collected PII due to system design, and none of the raw outputs that were inspected directly during the research included any PII.



Figure 1: Diagram of STUMP. Universal model elements shown in grey. Perspectives and contextspecific model elements colored by context.

Figure 2: Elicitation primitives. Color boxes indicate sparse vote data available for training. Colors denote context.

Figure 3: Type I inference accuracy vs votes per participant.

Figure 6: Type III inference accuracy vs participant fraction r.

Figure 9: Case 1 sampling protocol, where shaded regions correspond to sampled votes and color corresponds to context.

Figure 10: Case 2 sampling protocol, where shaded regions correspond to sampled votes and color corresponds to context.

and is used to pre-train participant embeddings. The second training set contains votes for a fraction r of participants on perspectives in a new context Q 2 , E t2 = {e a ij , e c ijk | i ∈ H s2 , j, k ∈ P 2 }, and r = n(H s2 )/(n(H s1 )+n(H s2 )). The pre-trained embeddings are frozen while the transformation to latent space for the new context is learned. Votes on perspectives in Q 2 by participants not in the second training set comprise the type III validation set E v = {e a ij , e c ijk | i ∈ H s1 , j, k ∈ P 2 } (more details are described in [A

Figure 11: Inference accuracy as a function of the number of local samples.

Properties of elicitation primitives.

annex

where A = {i, j|a ij }, D = {i, j|d ij }, and C = {i, j, k|c ijk }. The NN model posterior is given bywhere 1 ||u|| * <τ denotes a uniform prior over the nuclear norm ball for u of radius τ . Leaning happens through optimization via SGD of the MAP estimator.A.4 DATA SOURCE Data used for this research comes from periodic load testing done by a private company for internal quality assurance, and was fully de-identified prior to use in the research. The researchers identified that portions of it could be interesting for this work due to the scale of the data and the predictive challenges associated with the nuance and ambiguity of certain questions that were asked. Data made available for this research came from the following open-ended questions asked during the collective response load test:• During the pandemic, what fun experiences have you newly discovered? These can be any activities, hobbies, or events you consider fun. • How or where did you learn the new routine or behavior?• Tell me about a product or service you have used MORE than ever in the past year. Why did you use it more?• Please tell me more -how is this impacting your decisions, if at all?

A.5 DATA AND TRAINING DETAILS

For each question in the dataset, there are an average of 529 perspectives, 14455 agreement exercises, and 10151 pair-choice exercises. The full dataset consists of 172,242 votes across 3,703 perspectives made by 534 humans.In training the STUMP model, we use L2 regularization on the transformation weight M and apply dropout on the universal embedding. We use a pre-trained multilingual universal sentence encoder (Apache-2.0 license) to generate perspective embeddings F (p i |ϕ). Hyper-parameters are tuned by grid search, and the following values are used in our experiments: regularization rate = 0.01, dropout rate = 0.3, latent participant embedding dimensions = 50, and learning rate = 0.001 with the Adam optimizer. The models are implemented with PyTorch, and trained on a PC with 4-core Intel i7-1165G7 processor.

A.6 ELICITATION PRIMITIVE EXPERIMENT DETAILS

The exercises in the training sets consist of a mix of agreement and pair-choice exercises where the agreement exercises account for 60%-75% of the total exercises.A.6.1 TYPE II: SINGLE CONTEXT, NON-OVERLAPPING SAMPLE Type I to Type II transition. For each question, we randomly split the participants into two equal-sized groups H s1 and H s2 , and their corresponding perspectives P s1 and P s2 . We select votes on perspectives in P s2 by participants in H s1 as the validation set,The train set consists of the following three pieces: votes on perspectives in P s1 by participants in H s1 , votes on perspectives in P s2 by participants in H s2 , and votes perspectives in P s1 submitted by a fraction p of participants in H s2 , where p ∈the train data has maximum participant overlap, i.e., it contains max number of participants with votes on perspectives in both P s1 and P s2 . When p = 0, the train data has zero participant overlap, i.e., it contains with votes on perspectives in both P s1 and P s2 (Figure 13a ).Type II with Different languages. For each question, we randomly split the participants into two groups H s1 and H s2 , and their corresponding perspectives P s1 and P s2 . The train set consists of exercises pertaining to perspectives in P s1 submitted by participants in H s1 , and exercises pertaining to perspectives in P s2 submitted by participants inThe validation set consists of exercises pertaining to perspectives in P s1 submitted by participants in H s2 , and exercises pertaining to perspectives in P s2 submitted by participants inThe second group perspectives are translated to nine languages respectively: German (de), Spanish (es), French (fr), Italian (it), Japanese (ja), Dutch (nl), Portuguese (pt), simplified Chinese (zh) and traditional Chinese (zhtw).

A.6.2 TYPE III: MULTIPLE CONTEXT, OVERLAPPING SAMPLE

Transfer learning. Given a pair of contexts Q 1 and Q 2 , we denote all perspectives in Q 1 as P 1 and all perspectives in Q 2 as P 2 . We split the participants who participated in both contexts randomly into two groups H s1 and H s2 . The pre-train set contains exercises pertaining to perspectives in P 1 submitted by all participants, E t1 = {e a ij , e c ijk | i ∈ H, j, k ∈ P 1 }. The second train set contains exercises pertaining to perspectives in P 2 submitted by participants inThe validation set contains exercises pertaining to perspectives in P 2 submitted by participants in 13b ).The model is pre-trained using the pre-train set E t1 , and the learned participant embeddings z are transferred to the next training cycle using the second train set E t2 . In the second training cycle, z is fixed while the new context transformation M 2 is learned. Then the participant embeddings z learned from pre-training and context transformation M 2 learned from the second training are together used to make predictions for exercises in the validation set.Increased pre-training data. Given a target context Q t and its perspectives P t , we select a set of N -1 contexts that do not contain Q t as the pre-train contexts. We split the participants who participated in all of the selected questions randomly into two equal-sized groups H s1 and H s2 . The pre-train set contains exercises pertaining to perspectives in the pre-train contexts submitted by all participants, E t1 = ∪ n=1...N,n̸ =t {e a ij , e c ijk | i ∈ H, j, k ∈ P n }. The second train set contains exercises pertaining to perspectives in P t submitted by participants inThe validation set contains exercises pertaining to perspectives in P t submitted by participants in

A.6.3 TYPE IV: MULTIPLE CONTEXT, NON-OVERLAPPING SAMPLE

Type III to Type IV transition. Given a pair of contexts Q 1 and Q 2 and their corresponding perspectives P 1 and P 2 , we split the participants who participated in both contexts randomly into two equal-sized groups H s1 and H s2 . The pre-train set contains exercises pertaining to perspectives in P 1 submitted by participants in H s1 and exercises pertaining to perspectives in P 1 submitted by a fraction p of participants in H s2 , where p ∈ [0, 1],The second train set contains exercises pertaining to perspectives in P 2 submitted by participants in H s2 , E t2 = {e a ij , e c ijk | i ∈ H s2 , j, k ∈ P 2 }. The validation set contains exercises pertaining to perspectives in P 2 submitted by participants in H s1 ,When p = 1, E t1 and E t2 have maximum participant overlap. When p = 0, they have zero participant overlap (Figure 13c ). Each additional sample elicits an additional 2mc votes and adds m new participants and m new perspectives to Θ. After q samples, n t = mc(2q -1) votes have been elicited, and Θ is comprised of K = mq perspectives and N = mq humans resulting in n(Θ) = m 2 q 2 . Noting the relation q = (n t + mc)/(2mc), then n(Θ) = n 2 t /(4c 2 ) + (n t m)/(2c) + m/4 ∼ O(n 2 t ), data leverage is β = n t /(4c 2 ) + m/(2c) + m/(4n t ) ∼ O(n t ), and ∂β ∂nt ≈ 1/(4c 2 ) for n t >> √ m/2.

A.7.2 CASE II

The first sample elicits mc votes and initializes Θ with N = m participants and K = m perspectives. Each additional sample elicits an additional mc votes and adds m/2 new participants and m new perspectives to Θ. After q samples, n t = mcq votes have been elicited, and Θ is comprised of K = mq perspectives and N = (m/2)(q + 1) humans resulting in n(Θ) = (m 2 q 2 + m 2 q)/2. Noting the relation q = n t /(mc), then n(Θ) = n 2 t /(2c 2 ) + (n t m)/(2c) ∼ O(n 2 t ), data leverage is β = n t /(2c 2 ) + m/(2c) ∼ O(n t ), and ∂β ∂nt = 1/(2c 2 )

A.8 PROTOCOL SIMULATION EXPERIMENTS

Case I Given a set of perspectives corresponding to Q different contexts P 1 ...P Q , and Q equal sized non-overlapping sets of participants H s1 ...H sQ , we construct the following experiment for each q = 2...Q: let Θ q be the HPM corresponding to ∪ q n=1 {H sn } and ∪ q n=1 {P n }. Partition the training set in each case to correspond to local samples denoted by the shaded areas in Figure 9 such that ELet the validation set E v contain all other exercises in Θ q , denoted by the white areas in Figure 9 .Case II Given a set of perspectives corresponding to Q contexts P 1 ...P Q , and Q equal sized partially overlapping sets of participants H s1 ...H sQ with adjacent overlap p defined by |H si ∩ H si+1 | = p * |H si |, we construct the following experiment for each q = 2...Q:let Θ q be the HPM corresponding to ∪ q n=1 {H sn } and ∪ q n=1 {P n }, the train set is denoted by the shaded areas in Figure 10 , E q t = ∪ q n=1 {e a ij , e c ijk | i ∈ H sn , j, k ∈ P n }, and the validation set E v contains all other exercises in Θ q , denoted by the white areas in Figure 10 . 

