

Abstract

Information Lattice Learning (ILL) is a general framework to learn decomposed representations, called rules, of a signal such as an image or a probability distribution. Each rule is a coarsened signal used to gain some human-interpretable insight into what might govern the nature of the original signal. To summarize the signal, we need several disentangled rules arranged in a hierarchy, formalized by a lattice structure. ILL focuses on explainability and generalizability from "small data", and aims for rules akin to those humans distill from experience (rather than a representation optimized for a specific task like classification). This paper focuses on a mathematical and algorithmic presentation of ILL, then demonstrates how ILL addresses the core question "what makes X an X" or "what makes X different from Y" to create effective, rule-based explanations designed to help human learners understand. The key part here is what rather than tasks like generating X or predicting labels X,Y. Typical applications of ILL are presented for artistic and scientific knowledge discovery. These use ILL to learn music theory from scores and chemical laws from molecule data, revealing relationships between domains. We include initial benchmarks and assessments for ILL to demonstrate efficacy.

1. INTRODUCTION

With rapid progress in AI, there is an increasing desire for general AI (Goertzel & Pennachin, 2007; Chollet, 2019) and explainable AI (Adadi & Berrada, 2018; Molnar, 2019) , which exhibit broad, human-like cognitive capacities. One common pursuit is to move away from "black boxes" designed for specific tasks to achieve broad generalization through strong abstractions made from only a few examples, with neither unlimited priors nor unlimited data ("primitive priors" & "small data" instead). In this pursuit, we present a new, task-nonspecific framework-Information Lattice Learning (ILL)to learn representations akin to human-distilled rules, e.g., producing much of a standard music theory curriculum as well as new rules in a form directly interpretable by students (shown at the end). The term information lattice was first defined by Shannon (1953) , but remains largely conceptual and unexplored. In the context of abstraction and representation learning, we independently develop representation lattices that coincide with Shannon's information lattice when restricted to his context. Instead of inventing a new name, we adopt Shannon's. However, we not only generalize the original definition-an information lattice here is a hierarchical distribution of representations-but we also bring learning into the lattice, yielding the name ILL. ILL explains a signal (e.g., a probability distribution) by disentangled representations, called rules. A rule explains some but not all aspects of the signal, but together the collection of rules aims to capture a large part of the signal. ILL is specially designed to address the core question "what makes X an X" or "what makes X different from Y", emphasizing the what rather than generating X or predicting labels X,Y in order to facilitate effective, rule-based explanations designed to help human learners understand. A music AI classifying concertos, or generating one that mimics the masters, does not necessarily produce human insight about what makes a concerto a concerto or the best rules a novice composer might employ to write one. Our focus represents a shift from much representation-learning work (Bengio et al., 2013) that aim to find the best representation for solving a specific task (e.g., classification) rather than strong concern for explainability. Instead of optimizing a task-specific objective function (e.g., classification error), ILL balances more general objectives that favor fewer, simpler rules for interpretability, and more essential rules for effectiveness-all formalized later. One intuition behind ILL is to break the whole into simple pieces, similar to breaking a signal into a Fourier series. Yet, rather than decomposition via projection to orthonormal basis and synthesis via weighted sum, we decompose a signal in a hierarchical space called a lattice. Another intuition behind ILL is feature selection. Yet, rather than features, we use partitions to mimic human concepts and enable structured search in a partition lattice to mimic human learning. The goal is to restore human-like, hierarchical rule abstraction-and-realization through signal decomposition-and-synthesis in a lattice (called projection-and-lifting, Figure 1 : left), resulting in more than a sum of parts. ILL comprises two phases: (a) lattice construction; (b) learning (i.e., searching) in the lattice. This is similar to many machine learning (ML) models comprising (a) function class specification then (b) learning in the function class, e.g., constructing a neural network then learning-finding optimal parameters via back-propagation-in the network. ILL's construction phase is prior-efficient: it builds in universal priors that resemble human innate cognition (cf. the Core Knowledge priors (Spelke & Kinzler, 2007)), then grows a lattice of abstractions. The priors can be customized, however, to cater to a particular human learner, or facilitate more exotic knowledge discovery. ILL's learning phase is data-efficient: it learns from "small data" encoded by a signal, but searches for rich explanations of the signal via rule learning, wherein abstraction is key to "making small data large". Notably, the construction phase is prior-driven, not data-driven-data comes in only at the learning phase. Hence, the same construction may be reused in different learning phases for different data sets or even data on different topics (Figure 1 : right). Featuring these two phases, ILL is thus a hybrid model that threads the needle between a full data-driven model and a full prior-driven model, echoing the notion of "starting like a baby; learning like a child" (Hutson, 2018) . ILL is related to many research areas. It draws ideas and approaches from lattice theory, information theory, group theory, and optimization. It shares algorithmic similarity with a range of techniques including MaxEnt, data compression, autoencoders, and compressed sensing, but with a much greater focus on achieving human-like explainability and generalizability. Below, we broadly compares ILL to prominent, related models, leaving more comparisons to the Appendix for most similar ones.

Compared to

ILL is deep learning a "white-box" model balancing human-explainability and task performance Bayesian inference modeling human reasoning with widely shared, common priors and few, simple rules rather than using probabilistic inference as the driving force tree-like models structurally more general: a tree (e.g., decision tree or hierarchical clustering) is essentially a linear lattice (called a chain formally) depicting a unidirectional refinement or coarsening process concept lattice in FCA (Ganter & Wille, 2012) conceptually more general and may include both known and unknown concepts; ILL does not require but discovers domain knowledge (more details in Appendix A) We illustrate ILL applications by learning music theory from scores, chemical laws from compounds, and show how ILL's common priors facilitate mutual interpretation between the two subjects. To begin, imagine Tom and Jerry are playing two 12-key pianos simultaneously, one note at a time (Figure 1 : right). The frequency of the played two-note chords gives a 2D signal plotted as a 12 × 12 grayscale heatmap. Inspecting this heatmap, what might be the underlying rules that govern their co-play? (Check: all grey pixels have a larger "Jerry-coordinate" and project to a black key along the "Tom-axis".) We now elaborate on ILL and use it to distill rules for complex, realistic cases.



Figure 1: ILL's main idea: decompose the signal into rules that are individually simple but collectively expressive. A lattice is first constructed regardless the signal (prior-driven), yet the same lattice may be later used to learn rules (data-driven) of signals from different topics, e.g., music and chemistry.

