CLUSTERING AND ORDERING VARIABLE-SIZED SETS: THE CATALOG PROBLEM

Abstract

Prediction of a varying number of ordered clusters from sets of any cardinality is a challenging task for neural networks, combining elements of set representation, clustering and learning to order. This task arises in many diverse areas, ranging from medical triage, through multi-channel signal analysis for petroleum exploration to product catalog structure prediction. This paper focuses on the latter, which exemplifies a number of challenges inherent to adaptive ordered clustering, referred to further as the eponymous Catalog Problem. These include learning variable cluster constraints, exhibiting relational reasoning and managing combinatorial complexity. Despite progress in both neural clustering and set-tosequence methods, no joint, fully differentiable model exists to-date. We develop such a modular architecture, referred to further as Neural Ordered Clusters (NOC), enhance it with a specific mechanism for learning cluster-level cardinality constraints, and provide a robust comparison of its performance in relation to alternative models. We test our method on three datasets, including synthetic catalog structures and PROCAT, a dataset of real-world catalogs consisting of over 1.5 M products, achieving state-of-the-art results on a new, more challenging formulation of the underlying problem, which has not been addressed before. Additionally, we examine the network's ability to learn higher-order interactions and investigate its capacity to learn both compositional and structural rulesets.

1. INTRODUCTION

The ability to group members of a set and order these groups is key to many important real-world decision-making processes. It finds applications ranging from supply chain management (Wenzel et al., 2019) to prioritization in medical triage (Miles et al., 2020) . Other application domains include petroleum exploration (Rabiller et al., 2010) , business process analytics (Le et al., 2014) , and also product catalog structuring (Jurewicz & Derczynski, 2022) , where the goal is to take a set of products and work out how to group them together and order these groups to form a coherent product catalog. We term this problem of simultaneously grouping and ordering a set of items the Catalog Problem. This paper defines the Catalog Problem and presents an investigation into neural network approaches to it. To this end we introduce a fully-differentiable, deep learning (DL) model architecture that addresses the Catalog Problem. In it, sets of items are clustered into groups, and an ordering between groups is established. All of this is achieved in a supervised manner. While clustering methods are often unsupervised (Aljalbout et al., 2018; Ronen et al., 2022) , the meaningful ordering of clusters often requires more knowledge than is available from the instance representation alone. Similarly, learning to order is often framed as a supervised learning task (Vinyals et al., 2015; Yin et al., 2020; Shi, 2022) . Referred to further as set-to-sequence (S2S), this area and its corresponding methods inspire the cluster-ordering aspect of our proposed Neural Ordered Clusters (NOC) model. Both neural clustering and set-to-sequence models have limitations. Element-wise neural clustering methods require O(n) passes over the input set of cardinality n.foot_0 Cluster-wise and attention-based models are more computationally efficient, but exhibit a limited ability to learn cluster cardinality constraints (Pakman et al., 2020) , integral to both the prototypical Catalog Problem and its practical To address these challenges, we implement a unified clustering and cluster ordering method. NOC is capable of predicting ordered, partitional cluster assignments for elements of sets of varying cardinality. It infers a flexible, input-dependent number of diverse clusters, maintains O(k) complexity and utilizes a jointly learned representation of set elements to find the target cluster order. Unlike existing neural clustering methods, it exhibits the ability to learn cluster cardinality constraints through supervision. To our knowledge, no other neural-based method exists to address such challenges in an end-to-end, jointly trainable way, instead performing clustering and ordering as two separate tasks, sometimes with the separate addition of a representation learning step (Aljalbout et al., 2018) . All code, hyper-parameters and datasets required for reproducing our results are made available and detailed via the appendix. To summarize, our contributions are as follows: • Firstly, we introduce the Catalog Problem, a novel joint clustering and cluster ordering problem over sets of elements, which is a challenging variant of the set-to-sequence domain with multiple aspects that are not handled by existing neural methods. We exemplify and tackle this problem on three datasets, including a real-world dataset of over 1.5 M products grouped and ordered into product catalogs by human experts. • Secondly, we propose a novel, fully differentiable, joint neural clustering and cluster ordering model, Neural Ordered Clusters (NOC), capable of predicting an adaptive, inputdependent number of ordered, partitional clusters from sets of varying cardinality. • Thirdly, we provide a robust comparison of existing and proposed neural methods on the Catalog Problem using synthetic & real-world datasets, providing insights into the models' capacity to learn higher-order relational rules of cluster composition and ordered structure.

2. THE CATALOG PROBLEM

Many problems require predicting an adaptive, input-dependent number of partitional clusters from sets of varying cardinality and consequently ordering these clusters according to a target preference. We refer to this as the Catalog Problem. In the Catalog Problem, the input is an unordered set of unique elements. The output is a clustering of these elements, with suitable cluster cardinalities, and an ordering over the clusters (Figure 1 ). The input may be of any cardinality. Candidate approaches to the problem have to determine how many clusters to create, choose which items to assign to which clusters and also order the clusters. This is a general problem that, as is shown by experiments later in this paper, is non-trivial.



O(n) can be prohibitive with large input sets (n >= 1000), which is often the case in many interesting set-input problems such as 3D point cloud tasks(Qi et al., 2017; Ge et al., 2018; Zhao et al., 2021).



Figure1: The Catalog Problem. From left to right: a set of input elements (X); a clustering of those elements (C); and a target ordering over those clustered elements (y), left to right. The model has to perform all these tasks using information about inter-element relations and intra-cluster relations in order to characterise a cluster, and inter-cluster relations to generate the final, ordered clustering.

