NEURAL LEARNING OF ONE-OF-MANY SOLUTIONS FOR COMBINATORIAL PROBLEMS IN STRUCTURED OUTPUT SPACES

Abstract

Recent research has proposed neural architectures for solving combinatorial problems in structured output spaces. In many such problems, there may exist multiple solutions for a given input, e.g. a partially filled Sudoku puzzle may have many completions satisfying all constraints. Further, we are often interested in finding any one of the possible solutions, without any preference between them. Existing approaches completely ignore this solution multiplicity. In this paper, we argue that being oblivious to the presence of multiple solutions can severely hamper their training ability. Our contribution is two fold. First, we formally define the task of learning one-of-many solutions for combinatorial problems in structured output spaces, which is applicable for solving several problems of interest such as N-Queens, and Sudoku. Second, we present a generic learning framework that adapts an existing prediction network for a combinatorial problem to handle solution multiplicity. Our framework uses a selection module, whose goal is to dynamically determine, for every input, the solution that is most effective for training the network parameters in any given learning iteration. We propose an RL based approach to jointly train the selection module with the prediction network. Experiments on three different domains, and using two different prediction networks, demonstrate that our framework significantly improves the accuracy in our setting, obtaining up to 21 pt gain over the baselines.

1. INTRODUCTION

Neural networks have become the de-facto standard for solving perceptual tasks over low level representations, such as pixels in an image or audio signals. Recent research has also explored their application for solving symbolic reasoning tasks, requiring higher level inferences, such as neural theorem proving (Rocktäschel et al., 2015; Evans & Grefenstette, 2018; Minervini et al., 2020) , and playing blocks world (Dong et al., 2019) . The advantage of neural models for these tasks is that it will create a unified, end-to-end trainable representation for integrated AI systems that combine perceptual and high level reasoning. Our paper focuses on one such high level reasoning task -solving combinatorial problems in structured output spaces, e.g., solving a Sudoku or N-Queens puzzle. These can be thought of as Constraint Satisfaction problems (CSPs) where the underlying constraints are not explicitly available, and need to be learned from training data. We focus on learning such constraints by a non-autoregressive neural model where variables in the structured output space are decoded simultaneously (and therefore independently). Notably, most of the current state-of-the-art neural models for solving combinatorial problems, e.g., SATNET (Wang et al., 2019) , RRN (Palm et al., 2018) , NLM (Dong et al., 2019) , work with non autoregressive architectures because of their high efficiency of training and inference, since they do not have to decode the solution sequentially. One of the key characteristics of such problems is solution multiplicity -there could be many correct solutions for any given input, even though we may be interested in finding any one of these solutions. For example, in a game of Sudoku with only 16 digits filled, there are always multiple correct solutions (McGuire et al., 2012) , and obtaining any one of them suffices for solving Sudoku. Unfortunately, existing literature has completely ignored solution multiplicity, resulting in sub-optimally trained networks. Our preliminary analysis of a state-of-the-art neural Sudoku solver (Palm et al., 2018) 1 , which trains and tests on instances with single solutions, showed that it achieves a high accuracy of 96% on instances with single solution, but the accuracy drops to less than 25%, when tested on inputs that have multiple solutions. Intuitively, the challenge comes from the fact that (a) there could be a very large number of possible solutions for a given input, and (b) the solutions may be highly varied. For example, a 16-givens Sudoku puzzle could have as many as 10,000 solutions, with maximum hamming distance between any two solutions being 61. Hence, we argue that an explicit modeling effort is required to represent this solution multiplicity. As the first contribution of our work, we formally define the novel problem of One-of-Many Learning (1oML). It is given training data of the form {(x i , Y x i )}, where Y x i denotes a subset of all correct outputs Y x i associated with input x i . The goal of 1oML is to learn a function f such that, for any input x, f (x) = y for some y ∈ Y x . We show that a naïve strategy that uses separate loss terms for each (x i , y ij ) pair where y ij ∈ Y x i can result in a bad likelihood objective. Next, we introduce a multiplicity aware loss (CC-LOSS) and demonstrate its limitations for non-autoregressive models on structured output spaces. In response, we present our first-cut approach, MINLOSS, which picks up the single y ij closest to the prediction ŷi based on the current parameters of prediction network (base architecture for function f ), and uses it to compute and back-propagate the loss for that training sample x i . Though significantly better than naïve training, through a simple example, we demonstrate that MINLOSS can be sub-optimal in certain scenarios, due to its inability to pick a y ij based on global characteristics of solution space. To alleviate the issues with MINLOSS, we present two exploration based techniques, I-EXPLR and SELECTR, that select a y ij in a non-greedy fashion, unlike MINLOSS. Both techniques are generic in the sense that they can work with any prediction network for the given problem. I-EXPLR relies on the prediction network itself for selecting y ij , whereas SELECTR is an RL based learning framework which uses a selection module to decide which y ij should be picked for a given input x i , for back-propagating the loss in the next iteration. The SELECTR's selection module is trained jointly along with the prediction network using reinforcement learning, thus allowing us to trade-off exploration and exploitation in selecting the optimum y ij by learning a probability distribution over the space of possible y ij 's for any given input x i . We experiment on three CSPs: N-Queens, Futoshiki, and Sudoku. Our prediction networks for the first two problems are constructed using Neural Logic Machines (Dong et al., 2019) , and for Sudoku, we use a state-of-the-art neural solver based on Recurrent Relational Networks (Palm et al., 2018) . In all three problems, our experiments demonstrate that SELECTR vastly outperforms naïve baselines by up to 21 pts, underscoring the value of explicitly modeling solution multiplicity. SELECTR also consistently improves on other multiplicity aware methods, viz. CC-LOSS, MINLOSS, and I-EXPLR.

2. BACKGROUND AND RELATED WORK

Related ML Models: There are a few learning scenarios within weak supervision which may appear similar to the setting of 1oML, but are actually different from it. We first discuss them briefly. 'Partial Label Learning' (PLL) (Jin & Ghahramani, 2002; Cour et al., 2011; Xu et al., 2019; Feng & An, 2019; Cabannes et al., 2020) involves learning from the training data where, for each input, a noisy set of candidate labels is given amongst which only one label is correct. This is different from 1oML in which there is no training noise and all the solutions in the solution set Y x for a given x are correct. Though some of the recent approaches to tackle ambiguity in PLL (Cabannes et al., 2020) may be similar to our methods, i.e., MINLOSS , by the way of deciding which solution in the target set should be picked next for training, the motivations are quite different. Similarly, in the older work by (Jin & Ghahramani, 2002) , the EM model, where the loss for each candidate is weighted by the probability assigned to that candidate by the model itself, can be seen as a naïve exploration based approach, applied to a very different setting. In PLL, the objective is to select the correct label out of many incorrect ones to reduce training noise, whereas in 1oML, selecting only one label for training provably improves the learnability and there is no question of reducing noise as all the labels are correct. Further, most of the previous work on PLL considers classification over a discrete output space with, say, L labels, where as in 1oML, we work with structured output spaces, e.g., an r dimensional vector space where each dimension represents a discrete space of L labels. This exponentially increases the size of the output space, making it intractable to enumerate all possible solutions as is typically done in existing approaches for PLL (Jin & Ghahramani, 2002) . Within weak supervision, the work on 'Multi Instance Learning' (MIL) approach for Relation Extraction (RE) employs a selection module to pick a set of sentences to be used for training a relation classifier, given a set of noisy relation labels (Feng et al., 2018; Qin et al., 2018) . This is different from us where multiplicity is associated with any given input, not with a class (relation). Other than weak supervision, 1oML should also not be confused with the problems in the space of multi-label learning (Tsoumakas & Katakis, 2007) . In multi-label learning, given a solution set Y x for each input x, the goal is to correctly predict each possible solution in the set Y x for x. Typically, a classifier is learned for each of the possible labels separately. On the other hand, in 1oML, the objective is to learn any one of the correct solutions for a given input, and a single classifier is learned. The characteristics of the two problems are quite different, and hence, also the solution approaches. As we show later, the two settings lead to requirements for different kinds of generalization losses.

Solution Multiplicity in

Other Settings: There is some prior work related to our problem of solution multiplicity, albeit in different settings. An example is the task of video-prediction, where there can be multiple next frames (y ij ) for a given partial video x i (Henaff et al., 2017; Denton & Fergus, 2018) . The multiplicity of solutions here arises from the underlying uncertainty rather than as a inherent characteristic of the domain itself. Current approaches model the final prediction as a combination of the deterministic part oblivious to uncertainty, and a non-determinstic part caused by uncertainty. There is no such separation in our case since each solution is inherently different from others. Another line of work, which comes close to ours is the task of Neural Program Synthesis (Devlin et al., 2017; Bunel et al., 2018) . Given a set of Input-Output (IO) pairs, the goal is to generate a valid program conforming to the IO specifications. For a given IO pair, there could be multiple valid programs, and often, training data may only have one (or a few) of them. Bunel et al. (2018) propose a solution where they define an alternate RL based loss using the correctness of the generated program on a subset of held out IO pairs as reward. In our setting, in the absence of the constraints (or rules) of the CSP, there is no such additional signal available for training outside the subset of targets Y x for an input x. It would also be worthwhile to mention other tasks such as Neural Machine translation (Bahdanau et al., 2015; Sutskever et al., 2014) , Summarization (Nallapati et al., 2017; Paulus et al., 2018) , Image Captioning (Vinyals et al., 2017; You et al., 2016) etc., where one would expect to have multiple valid solutions for any given input. E.g., for a given sentence in language A, there could be multiple valid translations in language B. To the best of our knowledge, existing literature ignores solution multiplicity in such problems, and simply trains on all possible given labels for any given input. Models for Symbolic Reasoning: Our work follows the line of recent research, which proposes neural architectures for implicit symbolic and relational reasoning problems (Santoro et al., 2018; Palm et al., 2018; Wang et al., 2019; Dong et al., 2019) . We experiment with two architectures as base prediction networks: Neural Logic Machines (NLMs) (Dong et al., 2019) , and Recurrent Relational Networks (RRNs) (Palm et al., 2018) . NLMs allow learning of first-order logic rules expressed as Horn Clauses over a set of predicates, making them amenable to transfer over different domain sizes. The rules are instantiated over a given set of objects, where the groundings are represented as tensors in the neural space over which logical rules operate. RRNs use a graph neural network to learn relationships between symbols represented as nodes in the graph, and have been shown to be good at problems that require multiple steps of symbolic reasoning.

3.1. PROBLEM DEFINITION

Notation: Each possible solution (target) for an input (query) x is denoted by an r-dimensional vector y ∈ V r , where each element of y takes values from a discrete space denoted by V. Let Y = V r , and let Y x denote the set of all solutions associated with input x. We will use the term solution multiplicity to refer to the fact that there could be multiple possible solutions y for a given input x. In our setting, the solutions in Y x span a structured combinatorial subspace of V r , and can be thought of as representing solutions to an underlying Constraint Satisfaction Problem (CSP). For example in N-Queens, x would denote a partially filled board, and y denote a solution for the input board. Given a set of inputs x i along with a subset of associated solutions Y x i ⊆ Y x i , i.e., given a set of (x i , Y x i ) pairs, we are interested in learning a mapping from x to any one y among many possible solutions for x. Formally, we define the One-of-Many-Learning (1oML) problem as follows. Definition 1. Given training data D of the form, {(x i , Y x i )} m i=1 , where Y x i denotes a subset of solutions associated with input x i , and m is the size of training dataset, One-of-Many-Learning (1oML) is defined as the problem of learning a function f such that, for any input x, f (x) = y for some y ∈ Y x , where Y x is the set of all solutions associated with x. We use parameterized neural networks to represent our mapping function. We use M Θ to denote a non-autoregressive network M with associated set of parameters Θ. We use ŷi (ŷ) to denote the network output corresponding to input x i (x), i.e., ŷi (ŷ) is the arg max of the learnt conditional distribution over the output space Y given the input x i (x). We are interested in finding a Θ * that solves the 1oML problem as defined above. Next, we consider various formulations for the same.

3.2. OBJECTIVE FUNCTION

Naïve Objective: In the absence of solution multiplicity, i.e. when target set Y x i = {y i }, ∀i, the standard method to train such models is to minimize the total loss, L(Θ) = m i=1 l Θ (ŷ i , y i ), where l Θ (ŷ i , y i ) is the loss between the prediction ŷi and the unique target y i for the input x i . We find the optimal Θ * as argmin Θ L(Θ). A Naïve extension of this for 1oML would be to sum the loss over all targets in Y x , i.e., minimize the following loss function: L(Θ) = 1 m m i=1 y ij ∈Yx i l Θ (ŷ i , y ij ) We observe that loss function in eq. ( 1) would unnecessarily penalize the model when dealing with solution multiplicity. Even when it is correctly predicting one of the targets for an input x i , the loss with respect to the other targets in Y x i could be rather high, hence misguiding the training process. Example 1 below demonstrates such a case. For illustration, we will use the cross-entropy loss, i.e., l Θ (ŷ, y) = -k l 1{y[k] = v l } log(P (ŷ[k] = v l )) , where v l ∈ V varies over the elements of V, and k indices over r dimensions in the solution space. y[k] denotes the k th element of y. Example 1. Consider a learning problem over a discrete (Boolean) input space X = {0, 1} and Boolean target space in two dimensions, i.e., Y = V r = {0, 1}foot_1 . Let this be a trivial learning problem where ∀x, the solution set is Y x = {(0, 1), (1, 0)}. Then, given a set of examples {x i , Y x i }, the Naïve objective (with l Θ as cross entropy) will be minimized, when P (ŷ i [k] = 0) = P (ŷ i [k] = 1) = 0.5, for k ∈ {1, 2}, ∀i, which can not recover either of the desired solutions: (0, 1) or (1, 0). The problem arises from the fact that when dealing with 1oML, the training loss defined in eq. ( 1) is no longer a consistent predictor of the generalization error as formalized below. Lemma 1. The training loss L(Θ) as defined in eq. ( 1) is an inconsistent estimator of generalization error for 1oML, when l Θ is a zero-one loss, i.e., l Θ (ŷ i , y ij ) = 1{ŷ i = y ij }. (Proof in Appendix). For the task of PLL, Jin & Ghahramani (2002) propose a modification of the cross entropy loss to tackle multiplicity of labels in the training data. Instead of adding the log probabilities, it maximizes the log of total probability over the given target set. Inspired by Feng et al. (2020) , we call it CC-LOSS: L cc (Θ) = -1 m m i=1 log y ij ∈Yx i P r (y ij |x i ; Θ) . However, in the case of structured prediction, optimizing L cc requires careful implementation due to its numerical instability (see Appendix). Moreover, for non-autoregressive models, CC-LOSS also suffers from the same issues illustrated in example 1 for naïve objective. New Objective: We now motivate a better objective function based on an unbiased estimator. In general, we would like M Θ to learn a conditional probability distribution P r(y|x i ; Θ) over the output space Y such that the entire probability mass is concentrated on the desired solution set Y x i , i.e., y ij ∈Yx i P r(y ij |x i ; Θ) = 1, ∀i. If such a conditional distribution is learnt, then we can easily sample a y ij ∈ Y x i from it. CC-LOSS is indeed trying to achieve this. However, ours being a structured output space, it is intractable to represent all possible joint distributions over the possible solutions in Y x i , especially for non-autoregressive models 2 . Hence, we instead design a loss function which forces the model to learn a distribution in which the probability mass is concentrated on any one of the targets y ij ∈ Y x i . We call such distributions as one-hot. To do this, we introduce |Y x i | number of new learnable Boolean parameters, w i , for each query x i in the training data, and correspondingly define the following loss function: L w (Θ, w) = 1 m m i=1 y ij ∈Yx i w ij l Θ (ŷ i , y ij ) Here, w ij ∈ {0, 1} and j w ij = 1, ∀i, where j indices over solutions y ij ∈ Y x i . The last constraint over Boolean variables w ij enforces that exactly one of the weights in w i is 1 and all others are zero. Lemma 2. Under the assumption Y x i = Y x i , ∀i, the loss L (Θ) = min w L w (Θ, w), defined as the minimum value of L w (Θ, w) (defined in eq. ( 2)) with respect to w, is a consistent estimator of generalization error for 1oML, when l Θ is a zero-one loss, i.e., l Θ (ŷ i , y ij ) = 1{ŷ i = y ij }. We refer to Appendix for details. Next, we define our new objective as: min Θ,w L w (Θ, w) s.t. w ij ∈ {0, 1} ∀i, ∀j |Yx i | j=1 w ij = 1, ∀i = 1 . . . m

3.3. GREEDY FORMULATION: MINLOSS

In this section, we present one possible way to optimize our desired objective min Θ,w L w (Θ, w). It alternates between optimizing over the Θ parameters, and optimizing over w parameters. While Θ parameters are optimized using SGD, the weights w are selected greedily for a given Θ = Θ t at each iteration, i.e., it assigns a non-zero weight to the solution corresponding to the minimum loss amongst all the possible y ij ∈ Y x i for each i = 1 . . . m: w (t) ij = 1 y ij = argmin y∈Yx i l Θ (t) ŷ(t) i , y , ∀i = 1 . . . m This can be done by computing the loss with respect to each target, and picking the one which has the minimum loss. We refer to this approach as MINLOSS. Intuitively, for a given set of Θ (t) parameters, MINLOSS greedily picks the weight vector w i t) , and uses them to get the next set of Θ (t+1) parameters using SGD update. One significant challenge with MINLOSS is the fact that it chooses the current set of w parameters independently for each example based on current Θ values. While this way of picking the w parameters is optimal if Θ has reached the optima, i.e. Θ = Θ * , it can lead to sub-optimal choices when both Θ and w are being simultaneously trained. Following example illustrates this. Θ (t+1) ← Θ (t) -α Θ ∇ Θ L w (Θ, w) | Θ=Θ (t) ,w=w (t) (5) -1 -2 1 X X X X O x = - Example 2. Consider a simple task with a one-dimensional continuous input space X ⊂ R, and target space Y = {0, 1}. Consider learning with 10 examples, given as (x = 1, Y x = {1}) (5 examples), (x = -1, Y x = {0, 1}) (4 examples), (x = -2, Y x = {1}) (1 example). The optimal decision hypothesis is given as: y = 1{x > α}, for α ≤ -2, or y = 1{x < β}, for β ≥ 1. Assume learning this with logistic regression using MINLOSS as the training algorithm optimizing the objective in eq. (3). If we initialize the parameters of logistic such that the starting hypothesis is given by y = 1{x > 0} (logistic parameters: θ 1 = 0.1, θ 0 = 0), MINLOSS will greedily pick the target y = 0 for samples with x = -1, repeatedly. This will result in the learning algorithm converging to the decision hypothesis y = 1{x > -0.55}, which is sub-optimal since the input with x = -2 is incorrectly classified (fig. 1 , see Appendix for a detailed discussion). MINLOSS is not able to achieve the optimum since it greedily picks the target for each query x i based on current set of parameters and gets stuck in local mimima. This is addressed in the next section.

3.4. REINFORCEMENT LEARNING FORMULATION: SELECTR

In this section, we will design a training algorithm that fixes some of the issues observed with MINLOSS. Considering the Example 2 above, the main problem with MINLOSS is its inability to consider alternate targets which may not be greedily optimal at the current set of parameters. A better strategy will try to explore alternative solutions as a way of reaching better optima, e.g., in example 2 we could pick, for the input x = -1, the target y = 1 with some non-zero probability, to come out of the local optima. In the above case, this also happens to be the globally optimal strategy. This is the key motivation for our RL-based strategy proposed below. A natural questions arises: how should we assign the probability of picking a particular target? A naïve approach would use the probability assigned by the underlying M Θ network as a way of deciding the amount of exploration on each target y. We call it I-EXPLR. We argue below why this may not always be an optimal choice. We note that the amount of exploration required may depend in complex ways on the global solution landscape, as well as the current set of parameters. Therefore, we propose a strategy, which makes use of a separate selection module (a neural network), which takes as input, the current example (x i , Y x i ), and outputs the probability of picking each target for training Θ in the next iteration. Our strategy is RL-based since, we can think of choosing each target (for a given input) as an action that our selection module needs to take. Our selection module is trained using a reward that captures the quality of selecting the corresponding target for training the prediction network. We next describe its details. Selection Module (S φ ): This is an RL agent or a policy network where the action is to select a target, y ij ∈ Y x i , for each x i . Given a training sample, (x i , Y x i ), it first internally predicts ŷi _ = M Θ_ (x i ), using a past copy of the parameters Θ_. This prediction is then fed as an input along with the target set, Y x i , to a latent model, G φ , which outputs a probability distribution P r φ (y ij ), ∀y ij ∈ Y x i , s.t. y ij P r φ (y ij ) = 1. S φ then picks a target ȳi ∈ Y x i based on the distribution P r φ (y ij ) and returns a wi such that ∀i, wij = 1 if y ij = ȳi , and wij = 0 otherwise.

Update of φ Parameters:

The job of the selection module is to pick one target, ȳi ∈ Y x i , for each input x i , for training the prediction network M Θ . If we were given an oracle to tell us which ȳi is most suited for training M Θ , we would have trained the selection module S φ to match the oracle. In the absence of such an oracle, we train S φ using a reward scheme. Intuitively, ȳi would be a good choice for training M Θ , if it is "easier" for the model to learn to predict ȳi . In our reward design, we measure this degree of ease using hamming distance between ȳi and M Θ 's prediction ŷi , i.e., R(ŷ i , ȳi ) = r k=1 1{ŷ i [k] = ȳi [k]}. We note that there are other choices as well for the reward, e.g., a binary reward, which gives a positive reward of 1 only if the prediction model M Θ has learnt to predict the selected target ȳi . Our reward scheme is a granular proxy of this binary reward and makes it easier to get a partial reward even when the binary reward would be 0. The expected reward for RL can then be written as: R(φ) = m i=1 y ij ∈Yx i P r φ (y ij ) R (ŷ i , y ij ) (6) We make use of policy gradient to compute the derivative of the expected reward with respect to the φ parameters. Accordingly, update equation for φ can be written as: φ (t+1) ← φ (t) + α φ ∇ φ R (φ) | φ=φ (t) ) Update of Θ Parameters: Next step is to use the output of the selection module, wi corresponding to the sampled target ȳi , ∀i, to train the M Θ network. The update equation for updating the Θ parameters during next learning iteration can be written as: Θ (t+1) ← Θ (t) -α Θ ∇ Θ L w (Θ, w) | Θ=Θ (t) ,w= w(t) (8) Instead of backpropagating the loss gradient at a sampled target ȳi , one could also backpropagate the gradient of the expected loss given the distribution P r φ (y ij ). In our experiments, we backpropagate through the expected loss since our action space for the selection module S φ is tractable. Figure 2 represents the overall framework. In the diagram, gradients for updating Θ flow back through the red line and gradients for updating φ flow back through the green line.

3.5. TRAINING ALGORITHM

We put all the update equations together and describe the key components of our training algorithm below. Algorithm 1 presents a detailed pseudocode. Algorithm 1 Joint Training of Prediction Network M Θ & Selection Module S φ Θ0 ← Pre-train Θ using eq. ( 4) and eq. ( 5) In Selection Module (SM): Θ_ ← Θ0 φ0 ← Pre-train φ using rewards from MΘ in eq. ( 7) Initialize: t ← 0 while not converged do B ← Randomly fetch a mini-batch for i ∈ B do Get weights: w i ← S φ ((x i , Yx i ), Θ_) Get model predictions: ŷi ← M Θ t (x i ) Get rewards: r i ← [R(ŷ i , y ij ), ∀y ij ∈ Yx i ] end Update φ: Use eq. ( 7) to get φ (t+1) Update Θ: Use eq. ( 8) to get Θ (t+1) Update Θ_ ← Θ (t+1) if t%copyitr = 0 (in SM) Increment t ← t + 1 end Pre-training: It is a common strategy in many RL based approaches to first pre-train the network weights using a simple strategy. Accordingly, we pre-train both the M Θ and S φ networks before going into joint training. First, we pre-train M Θ . In our experiments, we observe that in some cases, pre-training M Θ using only those samples from training data D for which there is only a unique solution, i.e., {(x i , Y x i ) ∈ D s.t. |Y x i | = 1} gives better performance than pre-training with MIN-LOSS. Therefore, we pre-train using both the approaches and select the better one based on their performance on a held out dev set. Once the prediction network is pre-trained, a copy of it is given to the selection module to initialize M Θ _. Keeping Θ and Θ_ fixed and identical to each other, the latent model, G φ , in the selection module is pre-trained using the rewards given by the pre-trained M Θ and the internal predictions given by M Θ _. Joint Training: After pre-training, both prediction network M Θ and selection module S φ are trained jointly. In each iteration t, selection module first computes the weights, wt i , for each sample in the mini-batch. The prediction network computes the prediction ŷt i and rewards R(ŷ t i , y ij ), ∀y ij ∈ Y x i . The parameters φ t and Θ t are updated simultaneously using eq. ( 7) and eq. ( 8), respectively. The copy of the prediction network within selection module, i.e., M Θ _ in S φ , is updated with the latest parameters Θ t after every copyitr updates where copyitr is a hyper-parameter.

4. EXPERIMENTS

The main goal of our experiments is to evaluate the four multiplicity aware methods: CC-LOSS, MINLOSS, informed exploration (I-EXPLR) and RL based exploration (SELECTR), when compared to baseline approaches that completely disregard the problem of solution multiplicity. We also wish to assess the performance gap, if any, between queries with a unique solution and those with many possible solutions. To answer these questions, we conduct experiments on three different tasks (N-Queens, Futoshiki & Sudoku), trained over two different prediction networks, as described below.foot_2 

4.1. DATASETS AND PREDICTION NETWORKS

N-Queens: Given a query, i.e., a chess-board of size N × N and a placement of k < N non-attacking queens on it, the task of N Queens is to place the remaining N -k queens, such that no two queens are attacking each other. We train a Neural Logic Machine (NLM) model (Dong et al., 2019) as the prediction network M Θ for solving queries for this task. To model N-Queens within NLM, we represent a query x and the target y as N 2 dimensional Boolean vectors with 1 at locations where a Queen is placed. We use another smaller NLM architecture as the latent model G φ . We train our model on 10-Queens puzzles and test on 11-Queens puzzles, both with 5 placed queens. This size-invariance in training and test is a key strength of NLM architecture, which we exploit in our experiments. To generate the train data, we start with all possible valid 10-Queens board configurations and randomly mask any 5 queens, and then check for all possible valid completions to generate potentially multiple solutions for an input. Test data is also generated similarly. Training and testing on different board sizes ensures that no direct information leaks from test to train. Queries with multiple solutions have 2-6 solutions, so we choose Y x i = Y x i , ∀x i . Futoshiki: This is a logic puzzle in which we are given a grid of size N × N , and the goal is to fill the grid with digits from {1 . . . N } such that no digit is repeated in a row or a column. k out of N 2 positions are already filled in the input query x and the remaining N 2 -k positions need to be filled. Further, inequality constraints are specified between some pairs of adjacent grid positions, which need to be honored in the solution. Our prediction network, and latent model use NLM, and the details (described in Appendix) are very similar to that of N-Queens. Similar to N-Queens, we do size-invariant training -we train our models on 5 × 5 puzzles with 14 missing digits and test on 6 × 6 puzzles with 20 missing digits. Similar to N-Queens, we generate all possible valid grids and randomly mask out the requisite number of digits to generate train and test data. For both train and test queries we keep up to five inequality constraints of each type: > and <. Sudoku: We also experiment on Sudoku, which has been used as the task of choice for many recent neural reasoning works (Palm et al., 2018; Wang et al., 2019) . We use Relational Recurrent Networks (RRN) (Palm et al., 2018) as the prediction network since it has recently shown state-of-the-art performance on the task. We use a 5 layer CNN as our latent model G φ . Existing Sudoku datasets (Royle, 2014; Park, 2018) , do not expose the issues with solution multiplicity. In response, we generate our own dataset by starting with a collection of Sudoku puzzles with unique solutions that have 17 digits filled. We remove one of the digits, thus generating a puzzle, which is guaranteed to have solution multiplicity. We then randomly add 1 to 18 of the digits back from the solution of the original puzzle, while ensuring that the query continues to have more than 1 solution. This generates our set of multi-solution queries with a uniform distribution of filled digits from 17 to 34. We mix an equal number of unique solution queries (with same filled distribution). Because some x i s may have hundreds of solutions, we randomly sample 5 of them from Y x i , i.e., |Y x i | ≤ 5 in the train set. For each dataset, we generate a devset in a manner similar to the test set. We separately report performance on two mutually exclusive subsets of test data: OS: queries with a unique solution, and MS: those with multiple solutions. For all methods, we tune various hyperparameters (and do early stopping) based on the devset performance. Additional parameters for the four multiplicity aware methods include the ratio of OS and MS examples in training.foot_3 I-EXPLR and SELECTR also select the pre-training strategy as described in Section 3.5. For all tasks, we consider the output of a prediction network as correct only if it is a valid solution for the underlying CSP. No partial credit is given for guessing parts of the output correctly.

4.3. RESULTS AND DISCUSSION

We report the accuracies across all tasks and models in Table 2 . For each setting, we report the mean over three random runs (with different seeds), and also the accuracy on the best of these runs selected via the devset (in the parentheses). We first observe that Naïve and Random perform significantly worse than Unique in all the tasks, not only on MS, but on OS as well. This suggests that, 1oML models that explicitly handle solution multiplicity, even if by simply discarding multiple solutions, are much better than those that do not recognize it at all. Predictably, all multiplicity aware methods vastly improve upon the performance of naïve baselines, with a dramatic 13-52 pt gains between Unique and SELECTR on queries with multiple solutions. Comparing MINLOSS and SELECTR, we find that our RL-based approach outperforms MIN-LOSS consistently, with p-values (computed using McNemar's test for the best models selected based on validation set) of 1.00e-16, 0.03, and 1.69e-18 for NQueens, Futoshiki and Sudoku respectively (see Appendix for seedwise comparisons of gains across tasks). On the other hand, informed exploration technique, I-EXPLR, though improves over MINLOSS on two out of three tasks, it performs worse than SELECTR in all the domains. This highlights the value of RL based exploration on top of the greedy target selection of MINLOSS as well as over the simple exploration of I-EXPLR. We note that this is due to more exploratory power of SELECTR over I-EXPLR. See Appendix for more discussion and experiments comparing the two exploration techniques. Recall that Sudoku training set has no more than 5 solutions for a query, irrespective of the actual number of solutions -i.e, for many x i , Y x i Y x i . Despite incomplete solution set, significant improvement over baselines is obtained, indicating that our formulation handles solution multiplicity even with incomplete information. Furthermore, the large variation in the size of solution set (|Y x |) in Sudoku allows us to assess its effect on the overall performance. We find that all models get worse as |Y x | increases (fig. 3 ), even though SELECTR remains the most robust (see Appendix for details).

5. CONCLUSION AND FUTURE WORK

In this paper, we have defined 1oML: the task of learning one of many solutions for combinatorial problems in structured output spaces. We have identified solution multiplicity as an important aspect of the problem, which if not handled properly, may result in sub-optimal models. As a first cut solution, we proposed a greedy approach: MINLOSS formulation. We identified certain shortcomings with the greedy approach and proposed two exploration based formulations: I-EXPLR and an RL formulation, SELECTR, which overcomes some of the issues in MINLOSS by exploring the locally sub-optimal choices for better global optimization. Experiments on three different tasks using two different prediction networks demonstrate the effectiveness of our approach in training robust models under solution multiplicityfoot_4 . It is interesting to note that for traditional CSP solvers, e.g. (Selman et al., 1993; Mahajan et al., 2004 ), a problem with many solutions will be considered an easy problem, whereas for neural models, such problems appear much harder (Figure 3 ). As a future work, it will be interesting to combine symbolic CSP solvers with SELECTR to design a much stronger neuro-symbolic reasoning model. 1) is an inconsistent estimator of generalization error for 1oML, when l Θ is a zero-one loss, i.e., l Θ (ŷ i , y ij ) = 1{ŷ i = y ij }. (Proof in Appendix). Proof. Let D represent the distribution using which samples (x, Y x ) are generated. In our setting, generalization error ε(M Θ ) for a prediction network M Θ can be written as: ε(M Θ ) = E (x,Yx)∼D (1{ŷ / ∈ Y x }) , where ŷ = M Θ (x), i.e. the prediction of the network on unseen example sampled from the underlying data distribution. Assume a scenario when Y x i = Y x i , ∀i, i.e., for each input x i all the corresponding solutions are present in the training data. Then, an unbiased estimator εD (M Θ ) of the generalization error, computed using the training data is written as: εD (M Θ ) = 1 m m i=1 1{ŷ i / ∈ Y x i }. Clearly, the estimator obtained using L(Θ) (Naïve Objective), when the loss function l Θ (ŷ i , y ij ) is replaced by a zero-one loss 1{ŷ i = y ij }, is not a consistent estimator for the generalization error. This can be easily seen by considering a case when ŷi ∈ Y x i and |Y x i | > 1.

OPTIMIZATION ISSUES WITH CC-LOSS

For the task of PLL, Jin & Ghahramani (2002) propose a modification of the cross entropy loss to tackle multiplicity of labels in the training data. Instead of adding the log probabilities, it maximizes the log of total probability over the given target set. Inspired by Feng et al. (2020) , we call it CC-LOSS: L cc (Θ) = - 1 m m i=1 log   y ij ∈Yx i P r (y ij |x i ; Θ)   (9) However, in the case of structured prediction, optimizing L cc suffers from numerical instability. We illustrate this with an example. Consider solving 9 x 9 sudoku puzzle, x i . The probabilty of a particular target board, y ij , is a product of r = 9 2 = 81 individual probabilities over the discrete space V = {1 • • • 9} of size 9, i.e., P r(y ij |x i ; Θ) = r k=1 P r(y ij [k]|x i ; Θ). In the beginning of the training process, the network outputs nearly uniform probability over V for each of the r dimensions, making P r(y ij |x i ; Θ) very small (= 9 -81 ∼ 5.09e-78). The derivative of log of such a small quantity becomes numerically unstable. This issue is circumvented in the case of naïve loss by directly working with log probabilities and log-sum-exp trickfoot_6 . However, in the case of CC-LOSS, we need to sum the probabilities over the target set Y x i before taking log, and computing P r(y ij |x i ; Θ) makes it numerically unstable. Motivated by log-sum-exp trick, we use the following modifications which involves computing only log probabilities. For simplicity of notation, we will use Pr(y ij ) to denote Pr(y ij |x i ; Θ) and L i cc to denote the CC Loss for the i th training sample. L i cc = -log   y ij ∈Yx i P r(y ij )   Multiply and divide by maxp i = max y ij ∈Yx i P r(y ij ): L i cc = -log   maxp i y ij ∈Yx i P r(y ij ) maxp i   Published as a conference paper at ICLR 2021 Use the identity: α = exp(log(α)): L i cc = -log(maxp i ) -log   y ij ∈Yx i exp log P r(y ij ) maxp i   = -log(maxp i ) -log   y ij ∈Yx i exp (log (P r(y ij )) -log (maxp i ))   In the above equations, we first separate out the max probability target (similar to log-sum-exp trick), and then exploit the observation that the ratio of (small) probabilities is more numerically stable than the individual (small) probabilities. Further, we compute this ratio using the difference of individual log probabilities. Lemma 2. Under the assumption Y x i = Y x i , ∀i, the loss L (Θ) = min w L w (Θ, w), defined as the minimum value of L w (Θ, w) (defined in eq. ( 2)) with respect to w, is a consistent estimator of generalization error for 1oML, when l Θ is a zero-one loss, i.e., l Θ (ŷ i , y ij ) = 1{ŷ i = y ij }. Proof. Let D represent the distribution using which samples (x, Y x ) are generated. In our setting, generalization error ε(M Θ ) for a prediction network M Θ is: ε(M Θ ) = E (x,Yx)∼D (1{ŷ / ∈ Y x }) where ŷ = M Θ (x), i.e. the prediction of the network on unseen example sampled from the underlying data distribution. Assume a scenario when Y x i = Y x i , ∀i, i.e., for each input x i all the corresponding solutions are present in the training data. Then, an unbiased estimator εD (M Θ ) of the generalization error, computed using the training data is written as: εD (M Θ ) = 1 m m i=1 1{ŷ i / ∈ Y x i } Now, consider the objective function L (Θ) = min w L w (Θ, w) = min w 1 m m i=1 y ij ∈Yx i w ij 1{ŷ i = y ij } = 1 m m i=1 min w i y ij ∈Yx i w ij 1{ŷ i = y ij } s.t. w ij ∈ {0, 1} ∀i, ∀j |Yx i | j=1 w ij = 1, ∀i = 1 . . . m For any x i , if the prediction ŷi is correct, i.e., ∃y ij * ∈ Y x i s.t. ŷi = y ij * , then 1{ŷ i = y ij * } = 0 and 1{ŷ i = y ij } = 1, ∀y ij ∈ Y x i , y ij = y ij * . Now minimizing over w i ensures w ij * = 1 and w ij = 0 ∀y ij ∈ Y x i , y ij = y ij * . Thus, the contribution to the overall loss from this example x i is zero. On the other hand if the prediction is incorrect then 1{ŷ i = y ij } = 1, ∀y ij ∈ Y x i , thus making the loss from this example to be 1 irrespective of the choice of w i . As a result, L (Θ) is exactly equal to εD (M Θ ) and hence it is a consistent estimator for generalization error.

3.3. GREEDY FORMULATION: MINLOSS

Example 2. Consider a simple task with a one-dimensional continuous input space X ⊂ R, and target space Y = {0, 1}. Consider learning with 10 examples, given as (x = 1, Y x = {1}) (5 examples), (x = -1, Y x = {0, 1}) (4 examples), (x = -2, Y x = {1}) (1 example). The optimal decision hypothesis is given as: y = 1{x > α}, for α ≤ -2, or y = 1{x < β}, for β ≥ 1. Assume learning this with logistic regression using MINLOSS as the training algorithm optimizing the objective in eq. (3). If we initialize the parameters of logistic such that the starting hypothesis is given by y = 1{x > 0} (logistic parameters: θ 1 = 0.1, θ 0 = 0), MINLOSS will greedily pick the target y = 0 for samples with x = -1, repeatedly. This will result in the learning algorithm converging to the decision hypothesis y = 1{x > -0.55}, which is sub-optimal since the input with x = -2 is incorrectly classified (fig. 1 , see Appendix for a detailed discussion). -1 -2 1 For logistic regression, when input x is one dimensional, probability of the prediction being 1 for any given point x = [x] is given as: P (y = 1) = σ(θ 1 x + θ 0 ) where σ(z) = 1 1 + e -z , z ∈ R The decision boundary is the hyperplane on which the probability of the two classes, 0 and 1, is same, i.e. the hyperplane corresponding to P (y = 0) = P (y = 1) = 0.5 or θ 1 x + θ 0 = 0. Initially, θ 1 = 0.1 and θ 0 = 0 implies that decision boundary lies at x = 0 (shown in green). All the points on the left of decision boundary are predicted to have 0 label while all the points on the right have 1 label. For all the dual label points (x = 1), P (y = 1) < 0.5, thus MINLOSS greedily picks the label 0 for all these points. This choice by MINLOSS doesn't change unless the decision boundary goes beyond -1. However, we observe that with gradient descent using a sufficiently small learning rate, logistic regression converges at x = -0.55 with MINLOSS never flipping its choice. Clearly, this decision boundary is sub-optimal since we can define a linear decision boundary (y = 1{x > α}, for α ≤ -2, or y = 1{x < β}, for β ≥ 1) that classifies all the points with label 1 and achieves 100% accuracy.

4. EXPERIMENTS

All the experiments are repeated thrice using different seeds. Hyperparameters are selected based on the held out validation set performance. Hardware Architecture: Each experiment is run on a 12GB NVIDIA K40 GPU with 2880 CUDA cores and 4 cores of Intel E5-2680 V3 2.5GHz CPUs. Optimizer: We use Adam as our optimizer in all our experiments. Initial learning rate is set to 0.005 for NLM (Dong et al., 2019) experiments while it is kept at 0.001 for RRN (Palm et al., 2018) experiments. Learning rate for RL phase is kept at 0.1 times the initial learning rate. We reduce learning rate by a factor of 0.2 whenever the performance on the dev set plateaus.

4.1. DETAILS FOR N-QUEENS EXPERIMENT

Data Generation: To generate the train data, we start with all possible valid 10-Queens board configurations. We then generate queries by randomly masking any 5 queens. We check for all possible valid completions to generate potentially multiple solutions for any given query. Test data is also generated similarly. Training and testing on different board sizes ensures that no direct information leaks from the test dataset to the train dataset. Queries with multiple solutions have a small number of total solutions (2-6), hence we choose Y x i = Y x i , ∀x i . Architecture Details for Prediction Network M Θ : We use Neural Logic Machines (NLM)foot_8 (Dong et al., 2019) as the base prediction network for this task. NLM consists of a series of basic blocks, called 'Logic Modules', stacked on top of each other with residual connections. Number of blocks in an NLM architecture is referred to as its depth. Each block takes grounded predicates as input and learns to represent M intermediate predicates as its output. See (Dong et al., 2019) for further details. We chose an architecture with M = 8 and depth = 30. We keep the maximum arity of intermediate predicates learnt by the network to be 2. Input Output for Prediction Network: Input to NLM is provided in terms of grounded unary and binary predicates and the architecture learns to represent an unknown predicate in terms of the input predicates. Each cell on the board acts as an atomic variable over which predicates are defined. Unary Predicates: To indicate the presence of a Queen on a cell in the input, we use a unary predicate, 'HasQueenPrior'. It is represented as a Boolean tensor x of size N 2 with 1 on k out of N 2 cells indicating the presence of a Queen. The output y of the network is also a unary predicate 'HasQueen' which indicates the final position of the queens on board. Binary Predicates: We use 4 binary predicates to indicate if two cells are in same row, same column, same diagonal or same off-diagonal. The binary predicates are a constant for all board configurations for a given size N and hence can also be thought of as part of network architecture instead of input. Architecture Details for Selection Module S φ : We use another NLM as our latent model G φ within the selection module S φ . We fix depth = 4 and M = 10 for the latent model. Input Output for G φ : Input to G φ is provided in terms of grounded unary and binary predictates represented as tensors just like the prediction network. G φ takes 1 unary predicate as input, represented as an N 2 sized vector, y ij -ŷi _, where ŷi _ is the prediction from its internal copy of the prediction network (M Θ _) given the query x i . For each y ij ∈ Y x i , G φ returns a score which is converted into a probability distribution P r φ (y ij ) over Y x i using a softmax layer.

Hyperparameters:

The list below enumerates the various hyper-parameters with a brief description (whenever required) and the set of its values that we experiment with. Best value of a hyper-parameter is selected based on performance on a held out validation set. 1. Data Sampling: Since number of queries with multiple solutions is underrepresented in the training data, we up-sample them and experiment with different ratios of multi-solution queries in the training data. Specifically, we experiment with the ratios of 0.5 and 0.25 in addition to the two extremes of selecting queries with only unique or only multiple solutions. Different data sampling may be used during pre-training and RL fine tuning phases. 2. Batch Size: We use a batch size of 4. We selected the maximum batch size that can be accommodated in 12GB GPU memory. 3. copyitr: We experiment with two extremes of copying the prediction network after every update and copying after every 2500 updates. Training Time: Pre-training takes 10 -12 hours while RL fine-tuning take roughly 6 -8 hours using the hardware mentioned in the beginning of the section.

4.1. DETAILS FOR FUTOSHIKI EXPERIMENT

Data Generation: We start with generating all the possible ways in which we can fill a N × N grid such that no number appears twice in a row or column. For generating a query we sample any solution and randomly mask out k positions on it. Also we enumerate all the GreaterT han and LessT han relations between adjacent pair of cells in the chosen solution and randomly add q of these relations to the query. We check for all possible valid completions to generate potentially multiple solutions for any given query. Test data is also generated similarly. Training and testing on different board sizes ensures that no direct information leaks from the test dataset to the training data. Queries with multiple solutions have a small number of total solutions (2-6), so we choose Y x i = Y x i , ∀x i . Architecture Details for Prediction Network M Θ : Same as N-Queens experiment. Input Output for Prediction Network: Just like N-Queens experiment, the input to the network is a set of grounded unary and binary predicates. We define a grid cell along with the digit to be filled in it as an atomic variable. There are N 2 cells in the grid and each cell can take N values, thus we have N 3 atomic variables over which the predicates are defined. Unary Predicates: To indicate the presence of a value in a cell in the input, we use a unary predicate, 'IsPresentPrior'. It is represented as a Boolean tensor x of size N 3 with 1 on k positions indicating the presence of a digit in a cell. The output y of the network is also a unary predicate 'IsPresent' which indicates the final prediction of grid. Additionally, there are two more unary predicates which represent the inequality relations that need to be honoured. Since inequality relations are defined only between pair of adjacent cells we can represent them using unary predicates. Binary Predicates: We use 3 binary predicates to indicate if two vairables are in same row, same column, or same grid cell. The binary predicates are a constant for all board configurations for a given size N . Architecture Details for Selection Module S φ : Same as N-Queens experiment. Input Output for G φ : Same as N-Queens experiment except for the addition of two more unary predicates corresponding to the inequality relations. First unary predicate is y ij -ŷi _ which is augmented with the inequality predicates. Hyperparameters: Same as N-Queens experiment. Training Time: Pre-training takes roughly 12 -14 hours while RL fine-tuning takes 7 -8 hours.

DATA GENERATION FOR SUDOKU

We start with the dataset proposed by Palm et al. (2018) . It has 180k queries with only unique solution and the number of givens are uniformly distributed in the range from 17 to 34. 10 . For the queries with unique solution, we randomly sample 10000 queries from their dataset, keeping their train, val and test splits. Using the queries with 17-givens from the entire dataset of size 180k, we use the following procedure to create queries with multiple solutions: We know that for a Sudoku puzzle to have a unique solution it must have 17 or more givens (McGuire et al., 2012) . So we begin with the set of 17-givens puzzles having a unique solution and randomly remove 1 of the givens, giving us a 16-givens puzzle which necessarily has more than 1 correct solution. We then randomly add 1 to 18 of the digits back from the solution of the original puzzle, while ensuring that the query continues to have more than 1 solution.foot_10 This procedure gives us multi-solution queries with givens in the range of 17 to 34, just as the original dataset of puzzles with only unique solution. We also observed that often there are queries which have a very large number of solutions (> 100). We found that such Sudoku queries are often too poorly defined to be of any interest. So we filter out all queries having more than 50 solutions. To have the same uniform distribution of number of givens as in the original dataset of puzzles with unique solution, we sample queries from this set of puzzles with multiple solutions such that we have a uniform distribution of number of givens in our dataset. We repeat this procedure to generate our validation and test data by starting from validation and test datasets from Palm et al. (2018) . Architecture Details for Prediction Network M Θ : We use Recurrent Relational Network (RRN) (Palm et al., 2018) foot_11 as the prediction network for this task. RRN uses a message passing based inference algorithm on graph objects. We use the same architecture as used by Palm et al. (2018) for their Sudoku experiments. Each cell in grid is represented as a node in the graph. All the cells in the same row, column and box are connected in the graph. Each inference involves 32 steps of message passing between the nodes in the graph and the model outputs a prediction at each step. Input Output for Prediction Network: Input to the prediction network is represented as a 81 × 10 matrix with each of the 81 cell represented as a one-hot vector representing the digits (0-9, 0 if not given). Output of the prediction network is a 81 × 10 × 32 tensor formed by concatenating the prediction of network at each of the 32 steps of message passing. The prediction at the last step is used for computing accuracy. Architecture Details for Selection Module S φ : We use a CNN as the latent model G φ . The network consists of four convolutional layers followed by a fully connected layer. The four layers have 100, 64, 32 and 32 filters respectively. Each filter has a size of 3 × 3 with stride of length 1. Input Output for G φ : Similar to the other two experiments, the input to G φ is the output ŷi _ from the selection module's internal copy M Θ _ along with y ij . Since the prediction network gives an output at each step of message passing, we modify the G φ and the rewards for S φ accordingly to be computed from prediction at each step instead of relying only on the final prediction.

Hyperparameters:

1. Data Sampling: Since number of queries with multiple solutions and queries with unique solution are in equal proportion, we no longer need to upsample multi-solution queries. 2. Batch Size: We use a batch size of 32 for training the baselines, while for RL based training we use a batch size of 16. 3. copyitr: We experiment with copyitr = 1 i.e. copying M Θ to M Θ _ after every update. unique solution queries in our training data and contains 180,000 queries. This model achieves a high accuracy of 94.32% on queries having unique solution (OS) in our test data which is a random sample from their test data only, but the accuracy drop to 24.48% when tested on subset of our test data having only queries that have multiple solutions (MS). We notice that the performance on MS is worse than Unique baseline, even though both are trained using queries with only unique solution. This is because the pretrained model overfits on the the queries with unique solution whereas the Unique baseline early stops based on performance on a dev set having queries with multiple solutions as well, hence avoiding overfitting on unique solution queries. Training Time: Pre-training the RRN takes around 20 -22 hours whereas RL fine-tuning starting with the pretrained model takes around 10 -12 hours.

4.3. RESULTS AND DISCUSSIONS

Table 3 reports the mean test accuracy along with the standard error over three runs for different baselines and our three approaches. Note that the standard errors reported here are over variations in the choice of different random seeds and it is difficult to do a large number of such experiments (with varying seeds) due to high computational complexity. Below, we compare the performance gains for each of the seed separately.

Seed-wise Comparison for Gains of SELECTR over MINLOSS

In Table 4 we see that SELECTR performs better than MINLOSS for each of the three random seeds independently in all the experiments. We note that starting with the same seed in our implementation leads to identical initialization of the prediction network parameters. Details of the Analysis Depicted in Figure 3 The large variation in the size of solution set (|Y x |) in Sudoku allows us to assess its effect on the overall performance. To do so, we divide the test data into different bins based on the number of possible solutions for each test input (x i ) and compare the performance of the best model obtained in the three settings: Unique, MINLOSS and SELECTR. By construction, the number of test points with a unique solution is equal to the total number of test points with more than one solution. Further, while creating the puzzles with more than one solution, we ensured uniform distribution of number of filled cells from 17 to 34, as is done in (Palm et al., 2018) for creating puzzles with unique solutions in their paper. Hence, the number of points across different bins (representing solution count) may not be the same. Figure 6 shows the average size of each bin and the average number of filled cells for multiple solution queries in a bin. As we move to the right in graph (i.e., increase the number of solutions for a given problem), the number of filled cells in the corresponding Sudoku puzzles decreases, resulting in harder problems. This is also demonstrated by the corresponding decrease in performance of all the models in Figure 3 . SELECTR is most robust to this decrease in performance. Discussion on Why SELECTR is better than I-EXPLR? In this section, we argue why SELECTR is more powerful than I-EXPLR, even though the reward structure for training the RL agent is such that eventually the G φ in the RL agent will learn to pick the target closest to the current prediction (to maximize reward), and hence S φ will be reduced to I-EXPLR. We see two reasons why SELECTR is better than I-EXPLR. First, recall that the I-EXPLR strategy gives the model an exploration probability based on its current prediction. But note that this is "only one" of the possible exploration strategies. For example, another strategy could be to explore based on a fixed epsilon probability. There could be several other such possible exploration strategies that could be equally justified. Instead of hard coding them, as done for I-EXPLR, our G φ network gives the ability to learn the best exploration strategy, which may depend in complex ways on the global reward landscape (i.e., simultaneously optimizing reward over all the training examples). Hence we use a neural module for this. Second, note that I-EXPLR is parameter-free and fully dependent on M Θ , thus, has limited representational power of its own to explore targets. This is not the case with G φ . Its output ȳ and and the target (y c ) closest to M Θ prediction ŷ may differ i.e. ȳ = y c (see next paragraph for an experiment on this). When this happens, the gradients will encourage change in Θ so that ŷ moves towards ȳ, and simultaneously encourage change in φ so that ȳ moves towards y c . That is, a stable alignment between the two models could be either of the two, y c or ȳ. This, we believe, increases the overall exploration of the model. Which of y c or ȳ get chosen depends on how strongly the global landscape (other data points) encourage one versus the other. Such flexibility is not available to I-EXPLR where only Θ parameters are updated. We believe that this flexibility to explore more could enable SELECTR to jump off early local optima, thus achieving better performance compared to I-EXPLR. We provide preliminary experimental evidence that supports that SELECTR explores more. For every training data point q, we check if the arg max of G φ probability distribution (i.e., highest probability ȳ) and y c differ from each other. We name such data points "exploratory". We analyze the fraction of exploratory data points as a function of training batches. See fig. 7 . We observe that in the initial several batches, SELECTR has 3 -10% of training data exploratory. This number is, by definition, 0% for I-EXPLR since it chooses ȳ based on model probabilities. This experiment suggests that SELECTR may indeed explore more early on.



Available at https://data.dgl.ai/models/rrn-sudoku.pkl Autoregressive models may have the capacity to represent certain class of non-trivial joint distributions, e.g., P r(y[1], y[2]|x) could be modeled as P r(y[1]|x)P r(y[2]|y[1]; x), but requires sequential decoding during inference. Studying the impact of solution multiplicity on autoregressive models is beyond the current scope. Further details of software environments, hyperparameters and dataset generation are in the appendix. Futoshiki and N-Queens training datasets have significant OS-MS imbalance (see Table1), necessitating managing this ratio by undersampling OS. This is similar to standard approach in class imbalance problems. All the code and datasets are available at: https://sites.google.com/view/yatinnandwani/1oml http://supercomputing.iitd.ac.in https://blog.feedly.com/tricks-of-the-trade-logsumexp/ Image Source: Game play on http://www.brainmetrix.com/8-queens/ Code taken from: https://github.com/google/neural-logic-machines Available at https://data.dgl.ai/dataset/sudoku-hard.zip We identify all solutions to a puzzle using http://www.enjoysudoku.com/JSolve12.zip Code taken from: https://github.com/dmlc/dgl/tree/master/examples/pytorch/ rrn Available at: https://data.dgl.ai/models/rrn-sudoku.pkl



Figure 1: Decision Boundary learnt by logistic regression guided by MINLOSS. Green line at x = 0 is the initial decision boundary and black vertical line at x = -0.55 is the decision boundary at convergence.

Figure 2: Flow-diagram for our RL Framework

Figure 3: Accuracy vs size of query's solution set (with 95% confidence interval)

The training loss L(Θ) as defined in eq. (

Figure 4: Decision Boundary learnt by logistic regression guided by MINLOSS. Green vertical line at x = 0 is the initial decision boundary and black vertical line at x = -0.55 is the decision boundary at convergence.

Figure 5: 8-Queens query along with its two possible solution. 8

Weight Decay in Optimizer: We experiment with different weight decay factors of 1E-4, 1E-5 and 0. 5. Pretraining φ: We pretrain G φ for 250 updates.

Weight Decay in Optimizer: We experiment with weight decay factor of 1E-4 (same as Palm et al. (2018)). 5. Pretraining φ: We pretrain G φ for 1250 updates, equivalent to one pass over the train data. Comparison with pretrained SOTA Model: We also evaluate the performance of a pretrained state-of-the-art neural Sudoku solver (Palm et al., 2018) 13 on our dataset. This model trains and tests on instances with single solution. The training set used by this model is a super-set of the

Figure 6: #givens and #datapoints vs size of query's solution set

Statistics of datasets. 'Train', 'Test' and task names are abbreviated. Devset similar to test.

Mean (Max) test accuracy over three runs for multiplicity aware methods compared with baselines. OS: test queries with only one solution, MS: queries with more than one solution.

Mean test accuracy (±standard error) over three runs for multiplicity aware methods compared with baselines. OS: test queries with only one solution, MS: queries with more than one solution. ± 0.84 89.19 ± 1.12 87.53 ± 0.82 88.26 ± 0.88 88.25 ± 0.35 88.73 ± 0.68 88.69 ± 0.55 MS 09.13 ± 0.89 66.39 ± 2.82 13.65 ± 1.79 76.58 ± 1.63 76.93 ± 1.50 80.19 ± 1.51 81.73 ± 2.00 Overall 48.49 ± 0.86 77.79 ± 1.96 50.59 ± 0.49 82.42 ± 0.45 82.59 ± 0.62 84.46 ± 0.69 85.21 ± 0.76

Seed wise gains of SELECTR over MINLOSS across different random seeds and experiments

ACKNOWLEDGEMENT

We thank IIT Delhi HPC facility 6 for computational resources. We thank anonymous reviewers for their insightful comments and suggestions, in particular AnonReviewer4 for suggesting a simple yet effective informed exploration strategy (I-EXPLR). Mausam is supported by grants from Google, Bloomberg, 1MG and Jai Gupta chair fellowship by IIT Delhi. Parag Singla is supported by the DARPA Explainable Artificial Intelligence (XAI) Program with number N66001-17-2-4032. Both Mausam and Parag Singla are supported by the Visvesvaraya Young Faculty Fellowships by Govt. of India and IBM SUR awards. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of the funding agencies.

annex

Published as a conference paper at ICLR 2021 

