PARSED CATEGORIC ENCODINGS WITH AUTOMUNGE

Abstract

The Automunge open source python library platform for tabular data preprocessing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions. Further string parsing options, which may also be applied to unbounded categoric sets, include extraction of numeric substring partitions from entries or search functions to identify presence of specified substring partitions. The aggregation of these methods into "family tree" sets of transformations are demonstrated for use to automatically extract structure from categoric string compositions in relation to the set of entries in a column, such as may be applied to prepare categoric string set encodings for machine learning without human intervention.

1. AUTOMUNGE

Automunge is an open source python library, available now for pip install, built on top of Pandas (McKinney, 2010) , SciKit-Learn (Pedregosa et al., 2011) , Scipy (Virtanen et al., 2020) , and Numpy (van der Walt et al., 2011) . It takes as input tabular data received in a tidy form (Wickham, 2014) , meaning one column per feature and one row per observation, and returns numerically encoded sets with infill to missing points, thus providing a push-button means to feed raw tabular data directly to machine learning algorithms. The complexity of numerical encodings may be minimal, such as automated normalization of numerical sets and encoding of categorical, or may include more elaborate feature engineering transformations applied to distinct columns. Generally speaking, the transformations are performed based on a "fit" to properties of a column in a designated train set (e.g. based on a set's mean, standard deviation, or categorical entries), and then that same basis is used to consistently and efficiently apply transformations to subsequent designated test sets, such as may be intended for use in inference or for additional training data preparation. The library consists of two master functions, automunge(.) and postmunge(.). The automunge(.) function receives a raw train set and if available also a consistently formatted test set, and returns a collection of encoded sets intended for training, validation, and inference. The function also returns a populated python dictionary, which we call the postprocess dict, capturing all of the steps and parameters of transformations. This dictionary may then be passed along with subsequent test data to the postmunge(.) function for consistent processing on the train set basis, such as for instance may be applied sequentially to streams of data for inference. Because it makes use of train set properties evaluated during a corresponding automunge(.) call instead of directly evaluating properties of the test data, processing of subsequent test data in the postmunge(.) function is very efficient. Included in the platform is a library of feature engineering methods, which in some cases may be aggregated into sets to be applied to distinct columns. For such sets of transformations, as may include generations and branches of derivations, the order of implementation is designated by passing transformation categories as entries to a set of "family tree" primitives described further below. In tabular data applications, such as are the intended use for Automunge data preparations, a common tactic for practitioners is to treat categoric sets in a coarse-grained representation for presentation to a training operation. Such aggregations transform each unique categoric entry with distinct numeric encoding constructs, such as one-hot encoding in which unique entries are each represented with their own column for boolean activations, or ordinal encoding of single column integer representations. The Automunge library offers some further variations on categoric encodings [Fig. 1 ]. The default ordinal encoding 'ord3', activated when the number of unique entries exceeds some heuristic threshold, has encoding integers sorted by frequency of occurrence followed by alphabetical, where frequency in general may be considered more useful to a training operation. For sets with a number of unique entries below this threshold, the categoric defaults instead make use of '1010' binary encodings, meaning multi-column boolean activations in which representations of distinct categoric entries may be achieved with multiple simultaneous activations, which has benefits over one-hot encoding of reduced memory bandwidth. Special cases are made for categoric sets with 2 unique entries, which are converted by 'bnry' to a single column boolean representation, and sets with 3 unique entries are somewhat arbitrarily applied with 'text' one-hot encoding. Note that each transform has a default convention for infill to missing values which may be updated for configure to distinct columns. The categoric encoding defaults in general are based on assumptions of training models in the decision tree paradigms (e.g. Random Forest, Gradient Boosting, etc.), particularly considering that a binary representation may sacrifice compatibility for an entity embedding of categorical variables [4] as may be applied in the context of a neural network (for which we recommend a seeded representation of 'ord3' for ordinal encoding by frequency). The defaults for automation are all configurable, and alternate root categories of transformations may also be specified to distinct columns to overwrite the automation defaults. The characterization of these encodings as a coarse graining is meant to elude to the full discarding of the structure of the received representations. When we convert a set of unique values {'circle', 'square', 'triangle'} to {1, 2, 3}, we are hiding from the training operation any information that might be inferred from grammatical structure, such as the recognition of the prefix "tri" meaning three. Consider the word "Automunge". You can probably infer from the common prefix "auto" that there might be some automation involved, or if you are a data scientist you might recognize



Figure 1: Categoric encoding examples

