NUMERIC ENCODING OPTIONS WITH AUTOMUNGE

Abstract

Mainstream practice in machine learning with tabular data may take for granted that any feature engineering beyond scaling for numeric sets is superfluous in context of deep neural networks. This paper will offer arguments for potential benefits of extended encodings of numeric streams in deep learning by way of a survey of options for numeric transformations as available in the Automunge open source python library platform for tabular data pipelines, where transformations may be applied to distinct columns in "family tree" sets with generations and branches of derivations. Automunge transformation options include normalization, binning, noise injection, derivatives, and more. The aggregation of these methods into family tree sets of transformations are demonstrated for use to present numeric features to machine learning in multiple configurations of varying information content, as may be applied to encode numeric sets of unknown interpretation. Experiments demonstrate the realization of a novel generalized solution to data augmentation by noise injection for tabular learning, as may materially benefit model performance in applications with underserved training data.

1. INTRODUCTION

Of the various modalities of machine learning application (e.g. images, language, audio, etc.) tabular data, aka structured data, as may comprise tables of feature set columns and collected sample rows, in my experience does not command as much attention from the research community, for which I speculate may be partly attributed to the general non-uniformity across manifestations precluding the conventions of most other modalities for representative benchmarks and availability of pre-trained architectures as could be adapted with fine-tuning to practical applications. That is not to say that tabular data lacks points of uniformity across data sets, for at its core the various feature sets can at a high level be grouped into just two primary types: numeric and categoric. It was the focus of a recent paper by this author (Author, 2020) to explore methods of preparing categoric sets for machine learning as are available in the Automunge open source python library platform for tabular data pipelines. This paper will give similar treatment for methods to prepare numeric feature sets for machine learning. Of course it would be an oversimplification to characterize "numeric feature sets" as a sufficient descriptor alone to represent the wide amount of diversity as may be found between different such instances. Numeric could be referring to integers, floats, or combinations thereof. The set of entries could be bounded, the potential range of entries could be bounded on the left, right, or both sides, the distribution of values could be thin or fat tailed, single or multi-modal. The order of samples could be independent or sequential. In some cases the values could themselves be an encoded representation of a categoric feature. Beyond the potential diversity found within our numeric features, another source of diversity could be considered based on relationships between multiple feature sets. For example one feature could be independent of the others, could contain full or partial redundancy with one or more other variables by correlation, or in the case of sequential data there could even be causal relationships between variables across time steps. The primary focus of transformations to be discussed in this paper will not take into account variable interdependencies, and will instead operate under the assumption that the training operation of a downstream learning algorithm may be more suitable for the efficient interpretation of such interdependencies, as the convention for Automunge is that data transformations (and in some cases sets of transformations) are to be directed for application to a distinct feature set as input. In many cases the basis for these transformations will be properties derived from the received feature in a designated "train" set (as may be passed to the automunge(.) function) for subsequent application on a consistent basis to a designated "test" set (as may be passed to the postmunge(.) function).

2. NORMALIZATIONS

A common practice for preprocessing numeric feature sets for the application of neural networks is to apply a normalization operation in which received values are centered and scaled based on properties of the data. By conversion to comparable scale between features, backpropagation may have an easier time navigating the fitness landscape rather than overweighting to higher magnitude inputs (Ng, 2011) . Table 1 surveys a few normalization operations as available in Automunge. Upon inspection a few points of differentiation become evident. The choice of denominator can be material to the result, for while both (max -min) and standard deviation can have the result of shrinking or enlarging the values to fall within a more uniform range, the (max -min) variety has more of a known returned range for the output that is independent of the feature set distribution properties, thus allowing us to ensure all of the min-max returned values are non-negative for instance, as may be a pre-requisite for some kinds of algorithms. Of course this known range of output relies on the assumption that the range of values in subsequent test sets will correspond to the train set properties that serve as a basis -to allow a user to prevent these type of outliers from interfering with downstream applications, Automunge allows a user to pass parameters to the transformation functions, such as to activate floors or caps on the returned range. An easy to overlook outcome of the shifting and/or centering of the returned range by way of the subtraction operation in the numerator is a loss of the original zero point, as for example with z-score normalization the returned zero point is shifted to coincide with the original mean. It is the opinion of this author that such re-centering of the data may not always be a trivial trade-off. Consider the special properties of the number 0 in mathematics, including multiplicative properties at/above/below. By shifting the original zero point we are presenting a (small) obstacle to the training operation in order to relearn this point. Perhaps more importantly, further trade-offs include the interpretability of the returned data. Automunge thus offers a novel form of normalization, available in our library as 'retn' (standing for "retain"), that bases the formula applied to scale data on the range of values found within the train set, with the result of scaling the data within a similar range as some of those demonstrated above while also retaining the zero point and thus the +/-sign of all received data. 





Retain Normalization ('retn')

