Uploaded by Elie N

MLMag Article

advertisement
www.mlmag.co
Feature Engineering for Deep
Learning Models
Tom Masino and Elie Naufal
“Coming up with features is difficult, time-consuming, and requires expert
knowledge. ‘Applied machine learning’ is basically feature engineering.”
— Andrew Ng, Machine Learning and AI via Brain simulations
Authors:
Tom Masino (tom@applieddl.com) has been doing predictive modeling in financial
services for 18 years. His models have been deployed by several major trading
firms and hedge funds. He is currently a principal at Applied Deep Learning, LLC.
Tom has a Ph.D. in Neuroscience from the University of Chicago and lives in New
York City.
Elie Naufal (elie@applieddl.com) has been running investment strategies in hedge
funds for 20 years. He managed multi-strategies funds with over $500 million in
assets. Versed in machine learning and neural networks, he currently leads
modeling at Applied Deep Learning LLC. Elie did his M.S in Operations Research at
Columbia University and lives in New York City.
46
www.mlmag.co
There’s a central paradox in the application of Deep Neural Networks (DNN). DNNs are effective
because they discover features objectively within large and seemingly amorphous datasets. In fact,
their essential function is to discover features in data that is not otherwise apparent and correlate
patterns of those features to output targets. Yet, performance is greatly enhanced when the input data
set contains features derived explicitly from a principled feature engineering effort. In our hands and
those of many DNN modeling efforts, the contribution of feature engineering to the modeling process
is proving to be critical to maximize predictive capacity. This article describes some insights we have
gained into feature engineering from our end-to-end approach to predictive modeling using DNNs in
the financial services sector, and we hope that others can benefit from them.
There are many stages to building an end-to-end supervised machine learning model. We are
already familiar with data collection, cleaning and validating, exploratory data analysis (to understand
which aspects of the data are likely to have information content), building target functions (that show
broad range and clean target separation), and, of course, selecting and tuning the ML algorithm itself.
In our experience building several dozen predictive analytic models taking a deep learning approach,
we have discovered that the stage with the most powerful impact on model efficacy is feature
engineering.
Of course, it is important to deal carefully and vigorously with all the stages of model building
and testing, but we find that the feature engineering stage continues to be under-appreciated, even
so with the recent advent of feature engineering automation packages. While the latter could reduce
the human task for such data types as images, texts, and signals, feature engineering for relational and
human behavioral data benefits tremendously from domain expertise and human intuition.
Blending Exploratory Data Analysis and Feature Engineering
As seen in Figure 1, an end-to-end ML system has several stages. At the end of the exploratory
data analysis (EDA) stage, one has a clear understanding of the working data set, including noise
levels and the distributions of different elements. One has also dealt with any missing or erroneous
data points and has sufficient cleaned data to run a model. Plentiful, clean, and well-understood data
are the starting point for the feature engineering stage.
47
www.mlmag.co
Unlike exploratory data analysis, where one takes a structured approach to issues such as
sparseness, noisiness, distribution, and categorization of the data, feature engineering is less
formulaic. It exists in the problem domain whereas exploratory data analysis exists in the data domain.
For example, building features around image data will be quite different from building features for
auditory data. The exploratory data analysis would be similar for both. Feature engineering starts with
domain expertise and works top-down through the feature set, creating and assessing new metrics
derived from raw and newly derived data. In contrast, EDA takes a bottom-up approach, always asking
if a specific piece of data is clean, dense, and contains information. In fact, engineered features spring
mostly from academic or industry approaches to describing and analyzing a phenomenon in a
particular problem domain.
An example of engineered features from financial services
Our work in the financial securities domain is typical of a supervised learning model, where we
start with many standard simple features and build more derived features until we capture essential
elements of the behavior of the security. These simple metrics are sometimes called “gauge” metrics
since they are what one initially measures (as if you had a gauge attached to the system). In the case of
financial securities trading, our gauge metrics collected directly from the field are timestamp, order
price and size, and trade price and size. Feeding this dataset directly into a supervised learning model
of any complexity is unlikely to produce any useful prediction.
After collecting the gauge metrics, we build another data tier based on derived metrics. Since
we deal mostly with time-series (streaming) price data, our basic derived features include variance,
distribution, randomness, stationarity, liquidity levels, trade volumes, and market index comparison.
Even when we take into account this second tier of derived metrics (variance, stationarity, etc.),
the predictive power of the model is still quite limited. And here is where the modeler has to make a
difficult decision: do I invest time into various ML architectures, trying different algorithms and
tuning/tweaking to squeeze marginally better predictions from the system, or do I instead build more
derived input features that I think capture some essential real-world system dynamics? While these are
not mutually exclusive approaches, and nobody takes all of one or the other, we have found, indeed,
that putting at least as much effort into the feature engineering leads to a more efficient, reliable, and
accurate DNN architecture.
48
www.mlmag.co
A third tier of features is then derived from tier two. Those are proprietary in nature and are not
to be disclosed. But again, after each tier of feature generation, we cycle through a set of exploratory
data analytics to assess sparseness, distribution, and overall potential information content of each
feature prior to feeding into the DNN.
Approaches to Feature Engineering
The core goal of feature engineering is to derive metrics that are more directly connected to the
learning target than the original gauge metrics. The more relevant a feature relates to the target, the
more feature patterns will an ML algorithm discover and connect to the target. Figure 2 contains a
sample of techniques we found useful, but more exhaustive lists can be found in the references.
49
www.mlmag.co
1. Numeric transforms (scaling features so they are numerically spread out evenly):
• Min-max scaling (maps full range onto 0-1 range, i.e. express as % of range)
• Log transformation (retracts heavy-tailed data, expands dynamic range)
• Windsorization (clips extreme values)
• Z-score (variance) scaling
• Quantizing or binning (reduces dimensionality)
2. Normalization (creates more even distribution of data):
• L2 (euclidean normalization):
3. Text Feature Engineering (cleans data):
• N-grams
• Stemming
• Frequency filtering
• Phrase detection
4. Dimensionality Reduction with Principal Component Analysis (PCA):
• When there are excessive linear relationships within a dataset, better features can be defined by
reducing the dimensionality of the data using a PCA approach. It is frequently helpful for
high-dimensional data with intrinsically lower dimensionality.
• We are basically taking a high-dimensional feature space and calculating the minimal number of
dimensions required to represent the data without losing information. The aim is to derive a
lower-dimension coordinate system through a linear combination of the original coordinates, using
the direction of maximal variance of the data along each new coordinate. It is worth taking the time
to understand this process. Keep in mind it is mathematically a linear process. Therefore, it is best
applied when your data shows some linear correlations, hence redundancy.
5. Spatio-temporal Transformations:
• When using time series data, it is frequently more useful to convert it to the spatial domain.
Examples of spatial metrics on time series data are event frequency, regularity, and power density
• Poisson Distribution
• Approximate Entropy
6. Covariance:
• One of our favorite derived feature types is coming up with new frames of reference against which to
measure some event. This comes somewhat naturally in the trading markets since we typically
compare a stock price movement to the S&P 500 price, or to some stock in a related industry. One
can go a lot further with this approach, generate truly new salient features by analyzing peripheral
events, and measure the impact of those events on the stock price.
7. K-means clustering & unsupervised learning:
• For more aggressive dimensionality reduction not dependent on linear relationships among the
features, we use cluster analysis. Typically, K-means is sufficient as an approach that systematically
works through datasets to find separated clusters. This type of analysis can give a clear sense of the
actual dimensionality of your problem space, which is normally much lower than initially perceived.
• In addition to generating a good set of features for model input, we use cluster analysis to guide the
architecture of the DNN. For example, the number of layers and the geometry of the penultimate
layers should bear some relationship to the intrinsic dimensionality of the problem space.
50
www.mlmag.co
As can be seen even from this abbreviated list of approaches, feature engineering tasks fall into
several groups:
• Cleaning and Normalizing: frequently considered part of the earlier data cleansing stage, but
sometimes considered as feature engineering.
• Dimensionality Reduction: it’s frequently less reduction and more re-orienting basis vectors to align
with maximal variance, yielding better signal-to-noise metrics.
• Comparative Metrics: casting a metric in terms of its value in a distribution or its difference to a
reference frame metric.
It is recommended to execute feature engineering tasks in this order since the earlier metrics can
contribute to the latter ones.
Application to Financial Services
Applying this approach to the trading markets has proven remarkably effective, in large part
because we were able to leverage a single, well-defined, albeit complex, model development
workflow to a variety of problems and data collections. Think about that notion for a minute and ask if
one could have made such a statement in the past. We simply have not experienced anything this
generally useful in our collective years of model building. For example, with a single modeling
approach, we have been able to deliver new value for:
●
Price prediction in near and far time frames
●
Volatility prediction
●
Actionable market anomaly detection
●
Counterparty interest levels and price sensitivity
●
News events impact predictions
●
Likelihood of completion for complex over-the-counter trades
●
Liquidity availability and reliability
●
Market impact of trade execution
Additional Benefits of Enhanced Feature Engineering
• Easier to interpret features engineered manually versus those discovered by Deep Neural Network
(DNN).
• More confidence in output from local domain experts, since they are cognizant that their insights
and efforts are being directly utilized.
51
www.mlmag.co
Cautions and Downsides
Feature Explosion? Yes, it is possible to have so many features in your model. In theory, it is less
obvious, since a typical neural network model is not particularly sensitive to the scale of the input
vectors, and the underlying math does not point to any optimal input vector size. Reality is that there
is a big difference in model behavior with a 100- versus a 10,000-input feature. Our experience has
been that more than a few hundred features will start to degrade the model output. Since the feature
engineering approach outlined above can result in a combinatorial explosion of features, one can get
into the excessive feature count territory quickly.
Let’s take a practical approach to this problem. Generally, we have a core set of features, which
we know is the key to the problem, that are always included. We then build a cohort of derived
features and ask if they add marginal value. Tracking these feature cohorts gives us a good idea of
what to include in the eventual model. Others including Feature Labs and H20.ai have addressed the
combinatorial explosion problem through more systematic approaches.
The other caution with this approach to feature engineering is that there is frequently no terminal
point to the project: it can go on forever. This is where both discipline and domain expertise are quite
valuable. A domain expert will advise on what types of methods are likely to yield pertinent results
and, perhaps more importantly, advise as when the model output is both useful and is unlikely to yield
much more value with additional work. We serve as domain experts for most of the models we have
built, and when we do not act as our own domain experts, we make an effort to bring one onto the
team. It is almost always a mistake to think that one can get through these efforts smoothly without
having a domain expert on board.
Conclusions
We are very excited about the utility of modern DNNs, and thankful for all the hard work that has
gone into developing this field into something with many practical applications. However, there is still
a lot of value that can be extracted from direct involvement of human domain experts, and that value
enters the end-to-end process at the feature engineering stage. In our hands, every model has been
improved with concentrated effort on feature engineering and, in several cases, it was the essential
step to success. The good news is that the ranks of feature engineering are growing, much like the
ranks of ML expertise.
52
www.mlmag.co
References:
Articles:
1 - https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf
2 - An Introduction to Variable and Feature Selection - Isabelle Guyon, Andree Elisseeff - Journal of Machine Learning
Research 3 (2003) 1157-1182
3 - Understanding Feature Engineering: Deep Learning Methods for Text Data by Dipanjan Sarkar @Kdnuggets.com
3 - Understanding Feature Engineering (Part 1, 2, 3 & 4) by Dipanjan Sarkar @towardsdatascience.com
4 - How to create New Features using Clustering!! @towardsdatascience.com
5 - KMeans Clustering for Classification @towardsdatascience.com
6 - A Comprehensive Guide to Data Exploration by Sunil Ray @analyticsvidhya.com
7 - Automatic feature engineering using deep learning and Bayesian inference @analyticsvidhya.com
PowerPoints:
8 - Feature Engineering in Machine Learning - Zdenek Zabokrtsky
Institute of Formal and Applied Linguistics, Charles University in Prague
9 - Feature Engineering in Machine Learning - Chun-Liang Li - Carnegie Mellon University
10 - Feature Engineering Getting the most out of data for predictive models - Gabriel Moreira @gspmoreira
11 - Winning Kaggle Competitions - Hendrik Jacob van Veen - Nubank Brasil
Books:
12 - Feature Extraction, Construction and Selection- A Data Mining Perspective (1998) - Huan Liu
13 - Mastering Feature Engineering Principles and Techniques for Data Scientists (Early Release) (2016) - Alice Zheng
14 - Feature Selection for Knowledge Discovery and Data Mining (1998) - Huan Liu
Companies.
Websites:
15 - https://www.featurelabs.com/
16 - https://www.h2o.ai/
17 - https://applieddl.com/
53
Download