www.mlmag.co Feature Engineering for Deep Learning Models Tom Masino and Elie Naufal “Coming up with features is difficult, time-consuming, and requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.” — Andrew Ng, Machine Learning and AI via Brain simulations Authors: Tom Masino (tom@applieddl.com) has been doing predictive modeling in financial services for 18 years. His models have been deployed by several major trading firms and hedge funds. He is currently a principal at Applied Deep Learning, LLC. Tom has a Ph.D. in Neuroscience from the University of Chicago and lives in New York City. Elie Naufal (elie@applieddl.com) has been running investment strategies in hedge funds for 20 years. He managed multi-strategies funds with over $500 million in assets. Versed in machine learning and neural networks, he currently leads modeling at Applied Deep Learning LLC. Elie did his M.S in Operations Research at Columbia University and lives in New York City. 46 www.mlmag.co There’s a central paradox in the application of Deep Neural Networks (DNN). DNNs are effective because they discover features objectively within large and seemingly amorphous datasets. In fact, their essential function is to discover features in data that is not otherwise apparent and correlate patterns of those features to output targets. Yet, performance is greatly enhanced when the input data set contains features derived explicitly from a principled feature engineering effort. In our hands and those of many DNN modeling efforts, the contribution of feature engineering to the modeling process is proving to be critical to maximize predictive capacity. This article describes some insights we have gained into feature engineering from our end-to-end approach to predictive modeling using DNNs in the financial services sector, and we hope that others can benefit from them. There are many stages to building an end-to-end supervised machine learning model. We are already familiar with data collection, cleaning and validating, exploratory data analysis (to understand which aspects of the data are likely to have information content), building target functions (that show broad range and clean target separation), and, of course, selecting and tuning the ML algorithm itself. In our experience building several dozen predictive analytic models taking a deep learning approach, we have discovered that the stage with the most powerful impact on model efficacy is feature engineering. Of course, it is important to deal carefully and vigorously with all the stages of model building and testing, but we find that the feature engineering stage continues to be under-appreciated, even so with the recent advent of feature engineering automation packages. While the latter could reduce the human task for such data types as images, texts, and signals, feature engineering for relational and human behavioral data benefits tremendously from domain expertise and human intuition. Blending Exploratory Data Analysis and Feature Engineering As seen in Figure 1, an end-to-end ML system has several stages. At the end of the exploratory data analysis (EDA) stage, one has a clear understanding of the working data set, including noise levels and the distributions of different elements. One has also dealt with any missing or erroneous data points and has sufficient cleaned data to run a model. Plentiful, clean, and well-understood data are the starting point for the feature engineering stage. 47 www.mlmag.co Unlike exploratory data analysis, where one takes a structured approach to issues such as sparseness, noisiness, distribution, and categorization of the data, feature engineering is less formulaic. It exists in the problem domain whereas exploratory data analysis exists in the data domain. For example, building features around image data will be quite different from building features for auditory data. The exploratory data analysis would be similar for both. Feature engineering starts with domain expertise and works top-down through the feature set, creating and assessing new metrics derived from raw and newly derived data. In contrast, EDA takes a bottom-up approach, always asking if a specific piece of data is clean, dense, and contains information. In fact, engineered features spring mostly from academic or industry approaches to describing and analyzing a phenomenon in a particular problem domain. An example of engineered features from financial services Our work in the financial securities domain is typical of a supervised learning model, where we start with many standard simple features and build more derived features until we capture essential elements of the behavior of the security. These simple metrics are sometimes called “gauge” metrics since they are what one initially measures (as if you had a gauge attached to the system). In the case of financial securities trading, our gauge metrics collected directly from the field are timestamp, order price and size, and trade price and size. Feeding this dataset directly into a supervised learning model of any complexity is unlikely to produce any useful prediction. After collecting the gauge metrics, we build another data tier based on derived metrics. Since we deal mostly with time-series (streaming) price data, our basic derived features include variance, distribution, randomness, stationarity, liquidity levels, trade volumes, and market index comparison. Even when we take into account this second tier of derived metrics (variance, stationarity, etc.), the predictive power of the model is still quite limited. And here is where the modeler has to make a difficult decision: do I invest time into various ML architectures, trying different algorithms and tuning/tweaking to squeeze marginally better predictions from the system, or do I instead build more derived input features that I think capture some essential real-world system dynamics? While these are not mutually exclusive approaches, and nobody takes all of one or the other, we have found, indeed, that putting at least as much effort into the feature engineering leads to a more efficient, reliable, and accurate DNN architecture. 48 www.mlmag.co A third tier of features is then derived from tier two. Those are proprietary in nature and are not to be disclosed. But again, after each tier of feature generation, we cycle through a set of exploratory data analytics to assess sparseness, distribution, and overall potential information content of each feature prior to feeding into the DNN. Approaches to Feature Engineering The core goal of feature engineering is to derive metrics that are more directly connected to the learning target than the original gauge metrics. The more relevant a feature relates to the target, the more feature patterns will an ML algorithm discover and connect to the target. Figure 2 contains a sample of techniques we found useful, but more exhaustive lists can be found in the references. 49 www.mlmag.co 1. Numeric transforms (scaling features so they are numerically spread out evenly): • Min-max scaling (maps full range onto 0-1 range, i.e. express as % of range) • Log transformation (retracts heavy-tailed data, expands dynamic range) • Windsorization (clips extreme values) • Z-score (variance) scaling • Quantizing or binning (reduces dimensionality) 2. Normalization (creates more even distribution of data): • L2 (euclidean normalization): 3. Text Feature Engineering (cleans data): • N-grams • Stemming • Frequency filtering • Phrase detection 4. Dimensionality Reduction with Principal Component Analysis (PCA): • When there are excessive linear relationships within a dataset, better features can be defined by reducing the dimensionality of the data using a PCA approach. It is frequently helpful for high-dimensional data with intrinsically lower dimensionality. • We are basically taking a high-dimensional feature space and calculating the minimal number of dimensions required to represent the data without losing information. The aim is to derive a lower-dimension coordinate system through a linear combination of the original coordinates, using the direction of maximal variance of the data along each new coordinate. It is worth taking the time to understand this process. Keep in mind it is mathematically a linear process. Therefore, it is best applied when your data shows some linear correlations, hence redundancy. 5. Spatio-temporal Transformations: • When using time series data, it is frequently more useful to convert it to the spatial domain. Examples of spatial metrics on time series data are event frequency, regularity, and power density • Poisson Distribution • Approximate Entropy 6. Covariance: • One of our favorite derived feature types is coming up with new frames of reference against which to measure some event. This comes somewhat naturally in the trading markets since we typically compare a stock price movement to the S&P 500 price, or to some stock in a related industry. One can go a lot further with this approach, generate truly new salient features by analyzing peripheral events, and measure the impact of those events on the stock price. 7. K-means clustering & unsupervised learning: • For more aggressive dimensionality reduction not dependent on linear relationships among the features, we use cluster analysis. Typically, K-means is sufficient as an approach that systematically works through datasets to find separated clusters. This type of analysis can give a clear sense of the actual dimensionality of your problem space, which is normally much lower than initially perceived. • In addition to generating a good set of features for model input, we use cluster analysis to guide the architecture of the DNN. For example, the number of layers and the geometry of the penultimate layers should bear some relationship to the intrinsic dimensionality of the problem space. 50 www.mlmag.co As can be seen even from this abbreviated list of approaches, feature engineering tasks fall into several groups: • Cleaning and Normalizing: frequently considered part of the earlier data cleansing stage, but sometimes considered as feature engineering. • Dimensionality Reduction: it’s frequently less reduction and more re-orienting basis vectors to align with maximal variance, yielding better signal-to-noise metrics. • Comparative Metrics: casting a metric in terms of its value in a distribution or its difference to a reference frame metric. It is recommended to execute feature engineering tasks in this order since the earlier metrics can contribute to the latter ones. Application to Financial Services Applying this approach to the trading markets has proven remarkably effective, in large part because we were able to leverage a single, well-defined, albeit complex, model development workflow to a variety of problems and data collections. Think about that notion for a minute and ask if one could have made such a statement in the past. We simply have not experienced anything this generally useful in our collective years of model building. For example, with a single modeling approach, we have been able to deliver new value for: ● Price prediction in near and far time frames ● Volatility prediction ● Actionable market anomaly detection ● Counterparty interest levels and price sensitivity ● News events impact predictions ● Likelihood of completion for complex over-the-counter trades ● Liquidity availability and reliability ● Market impact of trade execution Additional Benefits of Enhanced Feature Engineering • Easier to interpret features engineered manually versus those discovered by Deep Neural Network (DNN). • More confidence in output from local domain experts, since they are cognizant that their insights and efforts are being directly utilized. 51 www.mlmag.co Cautions and Downsides Feature Explosion? Yes, it is possible to have so many features in your model. In theory, it is less obvious, since a typical neural network model is not particularly sensitive to the scale of the input vectors, and the underlying math does not point to any optimal input vector size. Reality is that there is a big difference in model behavior with a 100- versus a 10,000-input feature. Our experience has been that more than a few hundred features will start to degrade the model output. Since the feature engineering approach outlined above can result in a combinatorial explosion of features, one can get into the excessive feature count territory quickly. Let’s take a practical approach to this problem. Generally, we have a core set of features, which we know is the key to the problem, that are always included. We then build a cohort of derived features and ask if they add marginal value. Tracking these feature cohorts gives us a good idea of what to include in the eventual model. Others including Feature Labs and H20.ai have addressed the combinatorial explosion problem through more systematic approaches. The other caution with this approach to feature engineering is that there is frequently no terminal point to the project: it can go on forever. This is where both discipline and domain expertise are quite valuable. A domain expert will advise on what types of methods are likely to yield pertinent results and, perhaps more importantly, advise as when the model output is both useful and is unlikely to yield much more value with additional work. We serve as domain experts for most of the models we have built, and when we do not act as our own domain experts, we make an effort to bring one onto the team. It is almost always a mistake to think that one can get through these efforts smoothly without having a domain expert on board. Conclusions We are very excited about the utility of modern DNNs, and thankful for all the hard work that has gone into developing this field into something with many practical applications. However, there is still a lot of value that can be extracted from direct involvement of human domain experts, and that value enters the end-to-end process at the feature engineering stage. In our hands, every model has been improved with concentrated effort on feature engineering and, in several cases, it was the essential step to success. The good news is that the ranks of feature engineering are growing, much like the ranks of ML expertise. 52 www.mlmag.co References: Articles: 1 - https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf 2 - An Introduction to Variable and Feature Selection - Isabelle Guyon, Andree Elisseeff - Journal of Machine Learning Research 3 (2003) 1157-1182 3 - Understanding Feature Engineering: Deep Learning Methods for Text Data by Dipanjan Sarkar @Kdnuggets.com 3 - Understanding Feature Engineering (Part 1, 2, 3 & 4) by Dipanjan Sarkar @towardsdatascience.com 4 - How to create New Features using Clustering!! @towardsdatascience.com 5 - KMeans Clustering for Classification @towardsdatascience.com 6 - A Comprehensive Guide to Data Exploration by Sunil Ray @analyticsvidhya.com 7 - Automatic feature engineering using deep learning and Bayesian inference @analyticsvidhya.com PowerPoints: 8 - Feature Engineering in Machine Learning - Zdenek Zabokrtsky Institute of Formal and Applied Linguistics, Charles University in Prague 9 - Feature Engineering in Machine Learning - Chun-Liang Li - Carnegie Mellon University 10 - Feature Engineering Getting the most out of data for predictive models - Gabriel Moreira @gspmoreira 11 - Winning Kaggle Competitions - Hendrik Jacob van Veen - Nubank Brasil Books: 12 - Feature Extraction, Construction and Selection- A Data Mining Perspective (1998) - Huan Liu 13 - Mastering Feature Engineering Principles and Techniques for Data Scientists (Early Release) (2016) - Alice Zheng 14 - Feature Selection for Knowledge Discovery and Data Mining (1998) - Huan Liu Companies. Websites: 15 - https://www.featurelabs.com/ 16 - https://www.h2o.ai/ 17 - https://applieddl.com/ 53