ON REPRESENTING (ANTI)SYMMETRIC FUNCTIONS

Abstract

Permutation-invariant, -equivariant, and -covariant functions and anti-symmetric functions are important in quantum physics, computer vision, and other disciplines. (Anti)symmetric neural networks have recently been developed and applied with great success. A few theoretical approximation results have been proven, but many questions are still open, especially for particles in more than one dimension and the anti-symmetric case, which this work focusses on. More concretely, we derive natural polynomial approximations in the symmetric case, and approximations based on a single generalized Slater determinant in the antisymmetric case. Unlike some previous super-exponential and discontinuous approximations, these seem a more promising basis for future tighter bounds. In the supplementary we also provide a complete and explicit universality proof of the Equivariant MultiLayer Perceptron, which implies universality of symmetric MLPs and the FermiNet.

1. INTRODUCTION

Neural Networks (NN), or more precisely, Multi-Layer Perceptrons (MLP), are universal function approximators [Pin99] in the sense that every (say) continuous function can be approximated arbitrarily well by a sufficiently large NN. The true power of NN though stems from the fact that they apparently have a bias towards functions we care about and that they can be trained by local gradient-descent or variations thereof. For many problems we have additional information about the function, e.g. symmetries under which the function of interest is invariant or covariant. Here we consider functions that are covariant x under permutations. y Of particular interest are functions that are invariant z , equivariant { , or antisymmetric | under permutations. Definition 1 ((Anti)symmetric and equivariant functions) A function φ : X n → R in n ∈ N variables is called symmetric iff φ(x 1 , ..., x n ) = φ(x π(1) , ..., x π(n) ) for all x 1 , ..., x n ∈ X for all permutations π ∈ S n , where S n := {π : {1 : n} → {1 : n} ∧ π is bijection} is called the symmetric group and {1 : n} is short for {1, ..., n}. Similarly, a function ψ : X n → R is called anti-symmetric (AS) iff ψ(x 1 , ..., x n ) = σ(π)ψ(x π(1) , ..., x π(n) ), where σ(π) = ±1 is the parity or sign of permutation π. A function ϕ : X n → X n is called equivariant under permutations iff ϕ(S π (x)) = S π (ϕ(x)), where x ≡ (x 1 , ..., x n ) and S π (x 1 , ..., x n ) := (x π(1) , ..., x π(n) ). Of course (anti)symmetric functions are also just functions, hence a NN of sufficient capacity can also represent (anti)symmetric functions, and if trained on an (anti)symmetric target could converge to an (anti)symmetric function. But NNs that can represent only (anti)symmetric functions are desirable for multiple reasons. Equivariant MLP (EMLP) are the basis for constructing symmetric functions by simply summing the output of the last layer, and for anti-symmetric (AS) functions by x In full generality, a function f : X → Y is covariant under group operations g ∈ G, if f (R X g (x)) = R Y g (f (x)) , where R X g : X → X and R Y g : Y → Y are representations of group (element) g ∈ G. y The symmetric group G = Sn is the group of all permutations=bijections π : {1, ..., n} → {1, ..., n}. z R Y g =Identity. Permutation-invariant functions are also called 'totally symmetric functions' or simply 'symmetric function'. { General Y and X , often Y = X and R Y g = R X g , also called covariant. | R Y g = ±1 for even/odd permutations. multiplying with Vandermonde determinants or by computing their generalized Slater determinant (GSD) defined later. The most prominent application is in quantum physics which represents systems of identical (fermions) bosons with (anti)symmetric wave functions [PSMF20] . Another application is classification of point clouds in computer vision, which should be invariant under permutation of points [ZKR + 18]. Even if a general NN can learn the (anti)symmetry, it will only do so approximately, but some applications require exact (anti)symmetry, for instance in quantum physics to guarantee upper bounds on the true ground state energy [PSMF20] . This has spawned interest in NNs that can represent only (anti)symmetric functions [ZKR + 18, HLL + 19]. A natural question is whether such NNs can represent all reasonable (anti)symmetric functions, which is the focus of this paper. We will answer this question for the (symmetric) EMLP [ZKR + 18] defined in Section 6 and for the (AS) FermiNet [PSMF20] defined in Sections 4&5&6. Approximation architectures need to satisfy a number of criteria to be practically useful: (a) they can approximate a large class of functions, e.g. all continuous (anti)symmetric functions, (b) only the (anti)symmetric functions can be represented, (c) a fast algorithm exists for computing the approximation, (d) the representation itself is continuous or differentiable, (e) the architecture is suitable for learning the function from data (which we don't discuss). Section 2 reviews existing approximation results for (anti)symmetric functions. Section 3 discusses various "naive" representations (linear, sampling, sorting) and their (dis)advantages, before introducing the "standard" solution that satisfies (a)-(e) based on algebraic composition of basis functions, symmetric polynomials, and polarized bases. For simplicity the section considers only totally symmetric functions of their n real-valued inputs (the d = 1 case), i.e. particles in one dimension. Section 4 proves the representation power of a single GSD for totally anti-symmetric (AS) functions (also d = 1). Technically we reduce the GSD to a Vandermonde determinant, and determine the loss of differentiability due to the Vandermonde determinant. From Sections 5 on we consider the general case of functions with n • d inputs that are (anti)symmetric when permuting their n ddimensional input vectors. The case d = 3 is particularly relevant for particles and point clouds in 3D space. The difficulties encountered for d = 1 transfer to d > 1, while the positive results don't, or only with considerable extra effort. The universality construction and proof for the EMLP is outlined in Section 6 with a proper treatment and all details in Sections 6-8 of the supplementary, which implies universality of symmetric MLPs and of the AS FermiNet. Section 7 concludes. We took great care to unify notation from different sources. The list of notation in the appendix should be helpful to disambiguate some similarly looking but different notation. Our main novel contributions are establishing the universality of the anti-symmetric FermiNet with a single GSD (Theorems 3&5&7) for d = 1 and d > 1 (the results are non-trivial and unexpected), and the universality of (2-hidden-layer) symmetric MLPs (Theorem 6) with a complete and explicit and self-contained equivariant universality construction based on (smooth) polynomials. We took care to avoid relying on results with inherently asymptotic or tabulation or discontinuous character, to enable (in future work) good approximation rates for specific function classes, such as smooth functions or those with 'nice' Fourier transform [Bar93, Mak96] , The supplementary material contains the extended version of this paper with (more) details, discussion, and proofs.

2. RELATED WORK

The study of universal approximation properties of NN has a long history, see e.g. [Pin99] for a pre-millennium survey, and e.g. [LSYZ20] [Kol57] (for d = 1) based on elementary symmetric polynomials und using Newton's identities, also known as Girard-Newton or Newton-Girard formulae, which we will generalize to d > 1. Another proof is provided based on homeomorphisms between vectors and ordered vectors, also with no obvious generalization to d > 1. They do not consider AS functions. For symmetric functions and any d ≥ 1, [HLL + 19] provide two proofs of the symmetric superposition theorem of [ZKR + 18]: Every symmetric function can be approximated by symmetric polynomials, symmetrized monomials can be represented as a permanents, and Ryser's formula brings the representation into the desired polarized superposition form. The down-side is that computing permanents is NP complete, and exponentially many symmetrized monomials are needed to approximate f . The second proof discretizes the input space into a n • d-dimensional lattice and uses indicator functions for each grid cell. They then symmetrize the indicator functions, and approximate f by these piecewise constant symmetric indicator functions instead of polynomials, also using Ryser formula for the final representation. Super-exponentially many indicator functions are needed, but explicit error bounds are provided. The construction is discontinuous but they remark on how to make it continuous. Approximating AS f for d ≥ 1 is based on a similar lattice construction, but by summing super-exponentially many Vandermonde determinants, leading to a similar bound. We show that a single Vandermonde/Slater determinant suffices but without bound. Additionally for d = 1 we determine the loss in smoothness this construction suffers from. [SI19] prove tighter but still exponential bounds if f is Lipschitz w.r.t. ∞ based on sorting which inevitably introduces irreparable discontinuities for d > 1. The FermiNet [PSMF20] is also based on EMLPs [ZKR + 18] but anti-symmetrizes not with Vandermonde determinants but with GSDs. It has shown remarkable practical performance for modelling the ground state of a variety of atoms and small molecules. To achieve good performance, a linear combination of GSDs has been used. We show that in principle a single GSD suffices, a sort of generalized Hartree-Fock approximation. This is contrast to the increasing number of conventional Slater determinants required for increasing accuracy. Our result implies (with some caveats) that the improved practical performance of multiple GSDs is due to a limited (approximation and/or learning) capacity of the EMLP, rather than a fundamental limit of the GSD.

3. ONE-DIMENSIONAL SYMMETRY

This section reviews various approaches to representing symmetric functions, and is the broadest review we are aware of. To ease discussion and notation, we consider d = 1 in this section. Most considerations generalize easily to d > 1, some require significant effort, and others break. We discuss various "naive" representations (linear, sampling) and their (dis)advantages, before introducing the "standard" solution that can satisfy (a)-(e). All representations consist of a finite set of fixed (inner) basis functions, which are linearly, algebraically, functionally, or otherwise combined. We then introduce symmetric polynomials, which can be used to prove the "standard" representation theorem for d = 1. The extended version contains a broader and deeper review of alternative representations, including composition by inversion, generally invariant linear bases, symmetric functions by sorting, and linear bases for symmetric polynomials. Indeed it is the broadest review we are aware of, and unified and summarized as far as possible in one big table. The extended review may also help to better grasp the concepts introduced in this section, since it is less dense and contains some illustrating examples. Motivation. Consider n ∈ N one-dimensional particles with coordinates x i ∈ R for particle i = 1, ..., n. In quantum mechanics the probability amplitude of the ground state can be described by a real-valued joint wave function χ(x 1 , ..., x n ). Bosons φ have a totally symmetric wave function: φ(x 1 , ..., x n ) = φ(x π(1) , ..., x π(n) ) for all permutations π ∈ S n ⊂ {1 : n} → {1 : n}. Fermions ψ have totally Anti-Symmetric (AS) wave functions: ψ(x 1 , ..., x n ) = σ(π)ψ(x π(1) , ..., x π(n) ), where σ(π) = ±1 is the parity or sign of permutation π. Wave functions are continuous and almost everywhere differentiable, and often posses higher derivatives or are even analytic. Nothing in this work hinges on any special properties wave functions may possess or interpreting them as such, and the precise conditions required for our results to hold are stated in the theorems. We are interested in representing or approximating all and only such (anti)symmetric functions by neural networks. Abbreviate x ≡ (x 1 , ..., x n ) and let S π (x) := (x π(1) , ..., x π(n) ) be the permuted coordinates. There is an easy way to (anti)symmetrize any function, φ(x) = 1 n! π∈Sn χ(S π (x)), ψ(x) = 1 n! π∈Sn σ(π)χ(S π (x)) and any (anti)symmetric function can be represented in this form (proof: use χ := φ or χ := ψ). If we train a NN χ : R n → R to approximate some function f : R n → R to accuracy ε > 0, then φ (ψ) are (anti)symmetric approximations of f to accuracy ε > 0 too, provided f itself is (anti)symmetric. Instead of averaging, the minimum or maximum or median or many other compositions would also work, but the average has the advantage that smooth χ lead to smooth φ and ψ, and more general, preserves many desirable properties such as (Lipschitz/absolute/...) continuity, (k-times) differentiability, analyticity, etc. It possibly has all important desirable properties, but one: Time complexity, sampling, learning. The problem with this approach is that it has n! terms, and evaluating χ super-exponentially often is intractable even for moderate n, especially if χ is a NN. There can also be no clever trick to linearly (anti)symmetrize arbitrary functions fast, intuitively since the sum pools n! independent regions of χ. In the extended version we prove that computing φ and ψ are indeed NP-hard. There we also show that approximating (1) by sampling permutations is unsuitable, especially for ψ due to sign cancellations. Even if we could compute (1) fast, a NN would represent the function separately on all n! regions, hence potentially requires n! more training samples to learn from than an intrinsically (anti)symmetric NN. See the extended version for a more detailed discussion. Function composition and bases. Before delving into proving universality of the EMLP and the FermiNet, it is instructive to first review the general concepts of function composition and basis functions, since a NN essentially is a composition of basis functions. We want to represent/decompose functions as f (x) = g(β(x)). In this work we are interested in symmetric β, where ultimately β will be represented by the first (couple of) layer(s) of an EMLP, and g by the second (couple of) layer(s). Of particular interest is β(x) = n i=1 η(x i ) for then β and hence f are obviously symmetric (permutation invariant) in x. Anti-symmetry is more difficult and will be dealt with later. Formally let (1 + λx i ) =: 1 + λe 1 (x) + λ 2 e 2 (x) + ... + λ n e n (x) f ∈ F ⊆ R n → R (3) } Functions may be defined on sub-spaces of R k , function composition may not exists, and convergence can be w.r.t. different topologies. We will ignore these technicalities unless important for our results, but the reader may assume compact-open topology, which induces uniform convergence on compacta. are an algebraic basis of all symmetric polynomials. Explicit expressions are e 1 (x) = i x i , and e 2 (x) = i<j x i x j , ..., and e n (x) = x 1 ...x n , and in general e b (x) = i1<...<i b x i1 ...x i b . For given x, the polynomial in λ on the l.h.s. of (3) can be expanded to the r.h.s. in quadratic time or by FFT even in time O(n log n), so the e(x) can be computed in time O(n log n), but is not of the desired form (2). Luckily Newton already solved this problem for us. Newton's identities express the elementary symmetric polynomials e 1 (x), ..., e n (x) as polynomials in p b (x) := n i=1 x b i , b = 1, ..., n, hence also β(x) := (p 1 (x), ..., p n (x)) is an algebraic basis for all symmetric polynomials, hence by closure for all continuous symmetric functions, and is of desired form (2): Theorem 2 (Symmetric polarized superposition [ZKR + 18, WFE + 19, Thm.7]) Every continuous symmetric function φ : R n → R can be represented as φ(x) = g( i η(x i )) with η(x) = (x, x 2 , ..., x n ) and continuous g : R n → R. [ZKR + 18] provide two proofs, one based on 'composition by inversion', the other using symmetric polynomials and Newton's identities. The non-trivial generalization to d > 1 is provided in Section 5. Theorem 2 is a symmetric version of the infamous Kolmogorov-Arnold superposition theorem [Kol57] , which solved Hilbert's 13th problem. Its deep and obscure ~constructions continue to fill whole PhD theses [Liu15, Act18] . It is quite remarkable that the symmetric version above is very natural and comparably easy to prove. For given x, the basis β(x) can be computed in time O(n 2 ), so is actually slower to compute than e(x). The elementary symmetric polynomials also have other advantages (integral coefficients for integral polynomials, works for fields other than R, is numerically more stable, mimics 1,2,3,... particle interactions), so symmetric NN based on e b rather than p b may be worth pursuing. Note that we need at least m ≥ n functional bases for a continuous representation, so Theorem 2 is optimal in this sense [WFE + 19]. The extended version contains a discussion and a table with bases and properties.

4. ONE-DIMENSIONAL ANTISYMMETRY

We now consider the anti-symmetric (AS) case for d = 1. We provide representations of AS functions in terms of generalized Slater determinants (GSD) of partially symmetric functions. In later sections we will discuss how these partially symmetric functions arise from equivariant functions and how to represent equivariant functions by EMLP. The reason for deferral is that EMLP are inherently tied to d > 1. Technically we show that the GSD can be reduced to a Vandermonde determinant, and exhibit a potential loss of differentiability due to the Vandermonde determinant. Let ϕ i : R → R be single-particle wave functions. Consider the matrix Φ(x) =    ϕ 1 (x 1 ) • • • ϕ n (x 1 ) . . . . . . . . . ϕ 1 (x n ) • • • ϕ n (x n )    where x ≡ (x 1 , ..., x n ). The (Slater) determinant det Φ(x) is anti-symmetric, but can represent only a small class of AS functions, essentially the AS analogue of product (wave) functions (pure states, Hartree-Fock approximation). Every continuous AS function can be approximated/represented by a finite/infinite linear combination of such determinants: ψ(x 1 , ..., x n ) = ∞ k=1 det Φ (k) (x), where Φ (k) ij (x) := ϕ (k) i (x j ) An alternative is to generalize the Slater determinant itself [PSMF20] by allowing the functions ϕ i (x j ) to depend on all variables Φ(x) =    ϕ 1 (x 1 |x =1 ) • • • ϕ n (x 1 |x =1 ) . . . . . . . . . ϕ 1 (x n |x =n ) • • • ϕ n (x n |x =n )    ~involving continuous η with derivative 0 almost everywhere, and not differentiable on a dense set of points. where x =i ≡ (x 1 , ..., x i-1 , x i+1 , ..., x n ). If ϕ i (x j |x =j ) is symmetric in x =j , which we henceforth assume , then exchanging x i ↔ x j is (still) equivalent to exchanging rows i and j in Φ(x), hence det Φ is still AS. The question arises how many GSD are needed to be able to represent every AS function ψ. The answer turns out to be 'just one', but with non-obvious smoothness relations: Any AS ψ can be represented by some Φ, any analytic ψ can be represented by an analytic Φ, for k-times continuously differentiable ψ ∈ C k , a large number of derivatives are potentially lost: Theorem 3 (Representation of all (analytic) AS ψ) For every (general/analytic/C k+n(n+1)/2 ) AS function ψ(x) there exist (general/analytic/C k ) ϕ i (x j |x =j ) symmetric in x =j such that ψ(x) = det Φ(x). Proof. sketch. The proof uses ϕ 1 (x j |x =j ) := χ(x 1:n ) with χ(x) := ψ(x)/∆(x) and ∆(x) := j<i (x i -x j ) and ϕ i (x j |x =j ) := x i-1 j for 1 < i ≤ d. For this choice, det Φ reduces to χ times the Vandermonde determinant ∆. χ is obviously symmetric, but properly extending its definition to x for which ∆(x) = 0 is subtle. For general ψ, any continuation will do. For polynomial ψ, representation ψ = χ • ∆ in terms of a symmetric polynomial χ and AS polynomial ∆ is well-known. This can be extended to analytic ψ via Taylor series expansion. For k -times differentiable ψ, the construction is more difficult and the reduction in differentiability unfortunate and perhaps surprising. Full proofs can be found in the extended version. We do not know whether the loss of 1 2 n(n + 1) derivatives is an artefact of the proof or 'real'. For instance, we have seen that linear anti-symmetrization preserves differentiability. In Section 5 we show that continuity is preserved (only for d = 1).

5. d-DIMENSIONAL (ANTI)SYMMETRY

This section generalizes the theorems from Sections 3 and 4 to d > 1: the symmetric polynomial algebraic basis and the generalized Slater determinant representation. Motivation. We now consider n ∈ N, d-dimensional particles with coordinates x i ∈ R d for particles i = 1, ..., n. For d = 3 we write x i = (x i , y i , z i ) ∈ R 3 . As before, Bosons/Fermions have symmetric/AS wave functions χ(x 1 , ..., x n ). That is, χ does not change/changes sign under the exchange of two vectors x i and x j . It is not symmetric/AS under the exchange of individual coordinates e.g. y i ↔ y j . X ≡ (x 1 , ..., x n ) is a matrix with n columns and d rows. The (representation of the) symmetry group is S d n := {S d π : π ∈ S n } with S d π (x 1 , ..., x n ) := (x π(1) , ..., x π(n) ), rather than S n•d . Functions f : R d•n → R invariant under S d n are sometimes called multisymmetric or block-symmetric, if calling them symmetric could cause confusion. Algebraic basis for multisymmetric polynomials. The elementary symmetric polynomials (3) have a generalization to d > 1 [Wey46] . We only present them for d = 3. The general case is obvious from them. They can be generated from n i=1 (1 + λx i + µy i + νz i ) =: 0≤p+q+r≤n λ p µ q ν r e pqr (X) Even for d = 3 the explicit expressions for e pqr are rather cumbersome, but straightforward to obtain. One can show that {e pqr : p + q + r ≤ n} is an algebraic basis of size m = ( n+3 3 ) -1 for all multisymmetric polynomials [Wey46] . Note that constant e 000 is not included/needed. For a given X, their values can be computed in time O(mn) by expanding (4) or in time O(m log n) by FFT, where m = O(n d ). Newton's identities also generalize: e pqr (X) are polynomials in the polarized sums p pqr (X) := n i=1 η pqr (x i ) with η pqr (x) := x p y q z r . The proofs are much more involved than for d = 1. For the general d-case we have: Theorem 4 (Multisymmetric polynomial algebraic basis [Wey46] ) Every continuous (block-=multi)symmetric function φ : R n•d → R can be represented as φ(X) = g( n i=1 η(x i )) with continuous g : R m → R and η : R d → R m defined as η p1...p d (x) = x p1 y p2 ...z p d for 1 ≤ p 1 + ... + p d ≤ n (p i ∈ {0, ..., n}), hence m = ( n+d d ) -1. The bar | is used to visually indicate this symmetry, otherwise there is no difference to using a comma.

The basis can be computed in time

O(m•d) ⊆ O(d•(n+1) d ). Note that there could be much smaller functional bases of size m = dn as per "our" composition-by-inversion argument for continuous representations in Section 3 of the extended version, which readily generalizes to d > 1, while the above minimal algebraic basis has larger size m = O(n d ) for n d > 1. It is an open question whether a continuous functional basis of size O(dn) exists, whether in polarized form (2) or not. Anti-Symmetry. For d = 1, all AS functions ψ have the same (core) zeros (called Fermion nodes), namely when x i = x j for some i = j, which form a union of linear spaces dividing R n into n! isomorphic partitions on which ψ are identical apart from sign ±1. This fact allowed representing every ψ as a product of a symmetric function φ and the universal anti-symmetric polynomial ∆, leading to representation Theorem 3. For d > 1, the Fermion nodes {X : ψ(X) = 0} form essentially arbitrary ψ-dependent unions of (non-linear) manifolds partitioning R dn into an arbitrary even number of cells of essentially arbitrary topology [Mit07] . This fact prevents a similar simple factoring (ψ = φ • ∆) and Vandermonde-like reduction in the proof of Theorem 3, and indeed prevents finite algebraic bases for AS polynomials. We can still show a similar representation result, albeit weaker and via a different construction: As in Section 4, consider Φ ij (X) := ϕ i (x j |x =j ), i.e. Φ(X) =    ϕ 1 (x 1 |x =1 ) • • • ϕ n (x 1 |x =1 ) . . . . . . . . . ϕ 1 (x n |x =n ) • • • ϕ n (x n |x =n )    where ϕ i (x j |x =j ) is symmetric in x =j . Theorem 5 (Representation of all AS ψ) For every AS function ψ(X) there exist ϕ i (x j |x =j ) symmetric in x =j such that ψ(X) = det Φ(X). The proof in the extended version is based on sorting x π(1) ≤ x π(2) ≤ ... ≤ x π(n) with suitable π, which is a discontinuous operation in X for d > 1. It is continuous though not differentiable for d = 1, hence ϕ i can be chosen continuous for continuous ψ in d = 1. For n = 2 and any d, any AS continuous/smooth/analytic/C k function ψ(x 1 , x 2 ) has an easy continuous/smooth/analytic/C k representation as a GSD. Choose ϕ 1 (x 1 |x 2 ) := 1 2 and ϕ 2 (x 1 |x 2 ) := ψ(x 1 , x 2 ). Whether this generalizes to n > 2 and d > 1 is an open problem.

6. NEURAL NETWORKS

In this section we will restrict the representation power of classical MLPs to equivariant functions, which are then used to define universal (anti)symmetric NN. Equivariance and all-but-one symmetry. We are mostly interested in (anti)symmetric functions, but for (de)composition we need equivariant functions, and directly need to consider the d-dimensional case. A function ϕ : (R d ) n → (R d ) n is called equivariant under permutations if ϕ(S d π (X)) = S d π (ϕ(X) ) for all permutations π ∈ S n . With slight abuse of notation we identify (ϕ(X) ) 1 ≡ ϕ 1 (X) ≡ ϕ 1 (x 1 , x 2 , ..., x n ) ≡ ϕ 1 (x 1 , x =1 ) with ϕ 1 (x 1 |x =1 ). It is easy to see (see extended version) that a function ϕ is equivariant (under permutations) if and only if ϕ i (X) = ϕ 1 (x i |x =i ) ∀i and ϕ 1 (x i |x =i ) is symmetric in x =i . Hence ϕ 1 suffices to describe all of ϕ. Equivariant Neural Network. We aim at approximating equivariant ϕ by an Equivariant MLP (EMLP) defined as follows: The output X ∈ R d ×n of an EMLP layer is computed from its input X ∈ R d×n by x i := τ i (X) := τ 1 (x i |x =i ) ≡ τ 1,W,V,u (x i |x =i ) := σ(Wx i + V j =i x j + u) (5) where τ : R d×n → R d ×n can be shown to be an equivariant "transfer" function with weight matrices W, V ∈ R d ×d , and u ∈ R d the biases. Using j as in [ZKR + 18] instead of j =i as in [PSMF20] works as well. A L-(hidden)-layer EMLP concatenates L such layers (with non-linearity removed from the last layer). It is easy to see that EMLP is indeed equivariant and that the argument of σ() is the only linear function in X with such invariance [ZKR + 18, Lem.3]. In previous work it may have been tacitly been assumed that the universality of the polarized representation in Theorem 2 implies that EMLP can approximate all equivariant functions. In Sections 7&8 of the extended version we provide an explicit construction and proof based on Theorem 4. The explicit steps in the proof can be retraced if one wishes to derive error bounds. Theorem 6 (Universality of (two-hidden-layer) EMLP) For any continuous non-linear activation function, EMLPs can approximate (uniformly on compacta) all and only the equivariant continuous functions. If σ is non-polynomial, a two-hidden-layer EMLP suffices. Proof. idea. The construction follows 4 steps: (1) representation of polynomials in a single vector x i , (2) multisymmetric polynomials in all-but-one vector crucially exploiting Theorem 4, (3) equivariant polynomials, (4) equivariant continuous functions. Each step (1-3) constructs an EMLP. One hidden-layer EMLP suffice by classical results for MLPs [Pin99] . Two layers can be merged, leading to a 2-hidden-layer EMLP. Step (4) uses the (Stone-)Weierstrass theorem, for which also quantitative versions exist with error bounds, e.g. based on Chebyshev polynomials. Universal Symmmetric Network. We can approximate all symmetric continuous functions by applying any symmetric continuous function ς : R d ×n → R d with the property ς(y, ..., y) = y to the output of an EMLP, e.g. ς(Y) = 1 n n i=1 y i or ς(Y) = max{y 1 , ..., y n } if d = 1. Universal AntiSymmmetric Network. By Theorem 5 we know that every AS function ψ can be represented as a GSD of n functions symmetric in all-but-one-variable. The proof of Theorem 5 requires only a single symmetric function, but using an EMLP with n equivariant function can potentially preserve smoothness as discussed in the extended paper. Let us define a (toy) FermiNet as computing a single GSD from the output of a universal EMLP. The real FermiNet developed in [PSMF20] contains a number of extra features, which improves practical performance, theoretically most notably particle pair representations. Since it is a superset of our toy definition, the following theorem also applies to the full FermiNet. We arrived at the following result (under the same conditions as in Theorem 6): Theorem 7 (Universality of the FermiNet) A FermiNet with a single GSD can approximate any continuous anti-symmetric function. For d = 1, the approximation is again uniform on compacta. For d > 1, the proof of Theorem 5 involves discontinuous ϕ i (x j |x =j ). Any discontinuous function can be approximated by continuous functions, not in ∞-norm but only weaker p-norm for 1 ≤ p < ∞. This implies the theorem also holds for d > 1 in L p norm. Whether a stronger L ∞ result holds is an important open problem, important because approximating continuous functions by discontinuous components can cause all kinds of problems.

7. DISCUSSION

We reviewed a variety of representations for (anti)symmetric functions (ψ)φ : (R d ) n → R. The most direct and natural way is as a sum over n! permutations of some other function χ. If χ ∈ C k then also (ψ)φ ∈ C k . Unfortunately this takes exponential time, or at least is NP hard, and other direct approaches such as sampling or sorting have their own problems. The most promising approach is using Equivariant MLPs, for which we provided a constructive and complete universality proof, combined with a trivial symmetrization and a non-trivial anti-symmetrization using a large number Slater determinants. We investigated to which extent a single generalized Slater determinant introduced in [PSMF20], which can be computed in time O(n 3 ), can represent all AS ψ. We have shown that for d = 1, all AS ψ ∈ C k+n(n-1)/2 can be represented as det Φ with Φ ∈ C k . Whether Φ ∈ C k suffices to represent all ψ ∈ C k is unknown for k > 0. For k = 0 it suffices. For d > 1 and n > 2, we were only able to show that AS ψ have representations using discontinuous Φ. Important problems regarding smoothness of the representation are open in the AS case. Whether continuous Φ can represent all continuous ψ is unknown for d > 1, and similar for differentiability and for other properties. Indeed, whether any computationally efficient continuous representation of all and only AS ψ is possible is unknown.



be a function (class) we wish to represent or approximate. Let β b : R n → R be basis functions for b = 1, ..., m ∈ N ∪ {∞}, and β ≡ (β 1 , ..., β m ) : R n → R m be what we call basis vector (function), and η b : R → R a basis template, sometimes called inner function [Act18] or polarized bass function. Let g ∈ G ⊆ R m → R be a composition function (class), sometimes called 'outer function' [Act18], which creates new functions from the basis functions. Let G • β = {g(β(•)) : g ∈ G} be the class of representable functions, and G • β its topological closure, i.e. the class of all approximable functions. } β is called a G-basis for F if F = G • β or F = G • β, depending on context. Interesting classes of compositions are linear G lin := {g : g(x) = a 0 + m i=1 x i ; a 0 , a i ∈ R}, algebraic G alg := {multivariate polynomials}, functional G f unc := R m → R, and C k -functional G k f unc := C k for k-times continuously differentiable functions. The extended version illustrates on some simple examples how larger composition classes G allow (drastically) smaller bases (m) to represent the same functions F. Algebraic basis for symmetric polynomials. It is well-known that the elementary symmetric polynomials e b (x) generated by n i=1

for recent results and references. For (anti)symmetric NN such investigation has only recently begun [ZKR + 18, WFE + 19, HLL + 19, SI19]. Functions on sets are necessarily invariant under permutation, since the order of set elements is irrelevant. For countable domain, [ZKR + 18] derive a general representation based on encoding domain elements as bits into the binary expansion of real numbers. They conjecture that the construction can be generalized to uncountable domains such as R d , but it would have to involve pathological everywhere discontinuous functions [WFE + 19]. Functions on sets of fixed size n are equivalent to symmetric functions in n variables. [ZKR + 18] prove a symmetric version of Kolmogorov-Arnold's superposition theorem

