PARSED CATEGORIC ENCODINGS WITH AUTOMUNGE

Abstract

The Automunge open source python library platform for tabular data preprocessing automates feature engineering data transformations of numerical encoding and missing data infill to received tidy data on bases fit to properties of columns in a designated train set for consistent and efficient application to subsequent data pipelines such as for inference, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Included in the library of transformations are methods to extract structure from bounded categorical string sets by way of automated string parsing, in which comparisons between entries in the set of unique values are parsed to identify character subset overlaps which may be encoded by appended columns of boolean overlap detection activations or by replacing string entries with identified overlap partitions. Further string parsing options, which may also be applied to unbounded categoric sets, include extraction of numeric substring partitions from entries or search functions to identify presence of specified substring partitions. The aggregation of these methods into "family tree" sets of transformations are demonstrated for use to automatically extract structure from categoric string compositions in relation to the set of entries in a column, such as may be applied to prepare categoric string set encodings for machine learning without human intervention.

1. AUTOMUNGE

Automunge is an open source python library, available now for pip install, built on top of Pandas (McKinney, 2010) , SciKit-Learn (Pedregosa et al., 2011) , Scipy (Virtanen et al., 2020), and Numpy (van der Walt et al., 2011) . It takes as input tabular data received in a tidy form (Wickham, 2014) , meaning one column per feature and one row per observation, and returns numerically encoded sets with infill to missing points, thus providing a push-button means to feed raw tabular data directly to machine learning algorithms. The complexity of numerical encodings may be minimal, such as automated normalization of numerical sets and encoding of categorical, or may include more elaborate feature engineering transformations applied to distinct columns. Generally speaking, the transformations are performed based on a "fit" to properties of a column in a designated train set (e.g. based on a set's mean, standard deviation, or categorical entries), and then that same basis is used to consistently and efficiently apply transformations to subsequent designated test sets, such as may be intended for use in inference or for additional training data preparation. The library consists of two master functions, automunge(.) and postmunge(.). The automunge(.) function receives a raw train set and if available also a consistently formatted test set, and returns a collection of encoded sets intended for training, validation, and inference. The function also returns a populated python dictionary, which we call the postprocess dict, capturing all of the steps and parameters of transformations. This dictionary may then be passed along with subsequent test data to the postmunge(.) function for consistent processing on the train set basis, such as for instance may be applied sequentially to streams of data for inference. Because it makes use of train set properties evaluated during a corresponding automunge(.) call instead of directly evaluating properties of the test data, processing of subsequent test data in the postmunge(.) function is very efficient. Included in the platform is a library of feature engineering methods, which in some cases may be aggregated into sets to be applied to distinct columns. For such sets of transformations, as may include generations and branches of derivations, the order of implementation is designated by passing transformation categories as entries to a set of "family tree" primitives described further below.

