Uploaded by darraghosullivan9

Lecture 1

advertisement
Lecture 1: Use cases, Ideas and Vocabulary, Model
Performance evaluation, Algorithms and Ethics
EF308: Econometrics and Forecasting
Dr. Parvati Neelakantan
1
Introduction: AI, ML and Alert Models
• Artificial Intelligence (AI) is a concept which relates to machines that endeavour to simulate
human thinking capability and, in particular, decision making behavior.
• Machine learning is an application of AI which allows machines (i.e., statistical learning models) to
interpret (i.e., learn from) patterns in historical data without being programmed explicitly to do so.
• Machine learning algorithms learn iteratively from the data – they derive pragmatic rules from
the data.
• As opposed to the traditional approach – whereby analysts devise a rule, from their
experience, which is then programmed for implementation.
2
Introduction: AI vs Machine Learning; [Artificial
Intelligence in Europe, Ernst & Young, 2019]
3
Machine learning vs traditional rule based alert model
approach
• Alert models generate a warning (‘a flag’) that a scenario of concern is likely to arise or have
arisen – e.g., an instance of fraud in banking.
• Machine learning alert models seek to systematically generate high quality alerts.
• Traditional rule based anti fraud approach: Analyst determines a rule from experience which when
broken generates an alert
• Example: if an account level cash outflow occurs in a country other than the country of
residence of a bank client and if it exceeds, say, two times the average daily outflow, generate
an alert.
• Machine learning approach, same scenario: A statistical learning algorithm devises from the data –
labelled fraud events and a vector of predictors – a rule which when broken generates an alert
• Example: a complex function of customer, account and transaction traits can arise which when
it takes any of a certain interval of values generates an alert.
4
Machine learning vs traditional rule based alert model
approach
• Notice that the latter rule for alert generation is devised by a machine learning algorithm via
patterns in historical data and not determined by an analyst directly.
• Both approaches are automated for implementation but the latter machine learning approach uses a
rule which is inferred from the data for the algorithm.
5
Machine Learning: Why now? Provenance
Three key new developments have enabled machine learning techniques to address a spate of
important problems, across industry sectors.
1. New insights in statistical modeling
2. Exponential rise in computer processing power and storage
3. Voluminous data (“datafication”),
McKinsey (2018): ‘An Executive’s Guide to AI’ gives us an interactive time line for this confluence
of trends.
6
Data produced in one day: World of data
7
Can AI really improve the decisions of experts? Yes!
AI can predict when you will die, No idea how it works. New Scientist Nov, 2019.
8
Can AI really add value in a business setting?
Potential (and active) commercially viable applications across industry sectors
• Customer Focus
• customer profiling
• recommendation systems
• dynamic pricing
• product positioning
• sales forecasting
• Employees
• recruitment, promotion and cessation
9
Perspectives from Corporations
10
Ernest & Young: Finance sector is a frontrunner in AI
applications
11
Investment Bank; Importance in Banking and Finance?
Investment
‘How will Big Data and Machine Learning change the investment landscape?
We think the change will be profound. ...the market will start reacting faster and will anticipate
traditional data sources e.g. quarterly corporate earnings, low frequency macroeconomic data. This
gives an edge to quant managers and those willing to adopt and learn new datasets and methods.
Eventually ‘old’ datasets will lose most predictive value and new datasets that capture ‘Big Data’ will
increasingly become standardized.’ JP Morgan, 2017
12
How is machine learning evaluated by end users across the
financial services industry sector?
BoE 2019: Prevalence of ML in Financial Services
13
BoE 2019: Pipeline for ML applications
14
BoE 2019: Anticipated ML applications
15
Ideas and Vocabulary
What should an AI process look like in a business setting?
16
How machine learning works
Bank of England Chapter 5 Machine Learning in UK Financial Services
17
Training and Test Samples Machine
• Training sample: a subset of the sample to train the model on.
• Test sample: a subset of the sample to test the trained model.
Divide a single data set into training and test sub-samples. Test set should be large enough to reveal
statistically meaningful results. It should be representative of the population is at all possible. Idea is
that a model fit to the training sample can ideally generalize well to the test sample. Test set serves as
a proxy for new data.
18
Technical and Pictorial representations of Machine Learning
Models
General functional form
Objective
Understanding the data: eliciting, from the data, the data generating process (the function 𝑓) which
has generated the data. In this setting we denote the input variable (features/ traits/ predictors/
independent variables/ regressors) by 𝑋 and the output (predicted/ dependent/ regressand or response)
variable by 𝑌 . In most cases, 𝑋 takes various values – observed in the cross-section or over time,
hence, we use 𝑋1 , 𝑋2 , 𝑋3 and so on.
19
General functional form
Assume that we observe a quantitative response variable 𝑌 for some predictor variables
𝑋1 , 𝑋2 , … , 𝑋𝑝 . We assume the existence of some relation, f , that can be written as
𝑌 =𝑓 𝑋 +𝜖
where 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑝 ) and 𝜖 is a random error term.
Understanding the data: We seek to estimate f .
20
General functional form
Prediction and Inference
The error term averages to 0, thus, we predict 𝑌 using
𝑌 = 𝑓(𝑋)
where 𝑌 is the estimate of 𝑌 and 𝑓 the resulting estimate of 𝑓.
𝑌 can be a binary variable indicative of an alert occurring or not occurring.
Our estimate of 𝑓 , 𝑓, can be provided by a range of algorithms e.g. linear regression models,
decision trees, neural nets, support vector machines etc.
21
Can we envisage the functional form 𝑓 pictorially? Let’s do
this in income - education space.
The Income data set.
Left: The red dots are the observed values of income (in tens of thousands of dollars) and years of education
for 30 individuals.
Right: The blue curve represents the true underlying relationship between income and years of education, which
is generally unknown (but is known in this case because the data were simulated). The black lines represent the
error associated with each observation. Note that some errors are positive (if an observation lies above the blue
curve) and some are negative (if an observation lies below the curve). Overall, these errors have approximately
22
mean zero.
Pictorial representation of 𝑓 : A surface in 3-d space
The plot displays income as a function of years of education and seniority in the Income data set.
The blue surface represents the true underlying relationship between income and years of education
and seniority, which is known since the data are simulated. The red dots indicate the observed values
of these quantities for 30 individuals. Black vertical lines represent errors.
23
What is this error term?
𝜖 = 𝑌 − 𝑓(𝑋) is the irreducible error, i.e. even if we know 𝑓(𝑋), we would still make errors in
prediction, since at each X = x, there is typically a distribution of possible Y values.
𝜖 contains:
• unmeasurable predictor variables (e.g., quality of management)
• or unmeasured (i.e. omitted/ overlooked) predictor variables
• or unspecified variation in observed predictor variables.
• inappropriate functional form relating the predictor to the outcomes.
24
Mean Squared Error criterion
Typically algorithms tend to minimize MSE. If 𝑓 and 𝑋 are assumed to be constant, then the expected
value of the squared difference between predicted and actual value of 𝑌 is
where 𝑉𝑎𝑟(𝜖) is the variance of the error term .
𝐸(𝑌 − 𝑌)2 is the mean squared error (MSE) in the sample (i.e., the training sample).
25
Alert model error criterion: Proportion of incorrect predictions.
Model selection: a popular point of departure is to minimize the training MSE.
Ultimately, we will see that this is a naive approach, in particular, in context of formulating
predictions.
In the context of ‘binary classifier’ e.g. alert models, the proportion of incorrect predictions is a
preferable error criterion - especially this proportion in a test sample.
26
Supervised v unsupervised learning
• Supervised learning: Supervised learning is about (i) the prediction of a response variable or (ii) the inference
of the importance of a predictor on a response variable.
• Learning process of an algorithm is ‘supervised’ by the observed response variable.
• Illustrative application: an anti-fraud model learns from iteratively inferring the association between
suspected fraud events and a set of features in the data.
• Indicative machine learning algorithms: decision trees, neural nets, support vector machines etc.
For the most part, this course pertains to supervised learning techniques, with a focus on binary classifier type
alert models.
27
Unsupervised learning
• Unsupervised learning involves no response variable
• Cluster analysis to establish the relationship between the variables.
• Illustrative application: one might identify distinct groups of customers, even without seeing
their spending patterns.
• Indicative machine learning algorithms: Principal Components Analysis, K-Means Clustering
etc.
• Technically, unsupervised learning is about addressing a high dimensional scatter plot challenge.
• If there are p variables, then there are p(p-1)/2 distinct scatter plots.
• As p grows visual inspection while feasible, is challenging to interpret: What are the
commonalities across variables?
• Criterion for success, unlike in supervised learning, is comparatively unclear.
28
Supervised learning
• Supervised learning: Regression vs Classification
Broadly speaking, problems with a quantitative (qualitative) response are regression (classification)
problems but
• Quantitative response variable: A monetary loss due to an operational event such as an
instance of fraud
• Qualitative response variable: Whether a fraudulent event is said to have occurred or not.
29
Regression versus Classification problems
• In practise, it is more complex as indicated by the examples below. Nonetheless, the language on the latter
slide is useful. Specifically, the same method can cater for both regression and catering problems
• K nearest neighbour and gradient boosting methods can be used with either quantitative or qualitative
response variables
• Logistic regressions and linear probabilty models have a qualitative response, but estimates via
regression the probability of an observation being in each select category.
• While this course will principally focus on binary classification methods, note that, on the whole, the
discussions also relate to the regression setting.
30
Black box algorithms
Accountability in the context of a ‘Black Box’ algorithm
1. Impact of Model Flexibility
2. Correlation v Causality
31
Black box algorithms
Impact of Model Flexibility
Interpretability-Explainability-Predictive Accuracy
32
Model Flexibility and Interpretability
• Flexibility of an algorithm is related to the range of shapes with which it can estimate the true f (section 2.3).
• More flexible model: more candidate features, more complex candidate functional forms relating the
features to the response.
• Technically: Flexibility of a curve is summarized in the number of degrees of freedom - the number of
free parameters to be fit.
• Example, there are 2 coefficients to be fit in an inflexible simple regression model.
• Interpretability
• Meaning: relates to our capacity to interpret, in intuitive and pragmatic (i.e. accurate) terms, how a
model has produced its predictions.
• Technical measurement rule: Capacity of a human (expert/ layman) to consistently predict a ML model’s
result. (A proxy - a short tree can receive a higher interpretabilty score.)
• Weakness of this rule: A shortcoming of this approach is the arbitrary nature of the individual recruited
for the assessment.
33
Interpretability-Flexibility ‘Trade-Off’
• How can flexibility impact Interpretability?
• More flexible models tend to be more complex and, thus, more difficult to explain/ interpret.
• Examples of highly flexible and difficult to interpret models: Neural nets, Bagging, boosting and SVMs
(with non linear kernels) and nearest neighbour approaches.
34
Interpretability-Flexibility ‘Trade-Off’
A representation of the trade-off between flexibility and interpretability, using different statistical learning
methods. In general, as the flexibility of a method increases, its interpretability decreases.
35
Importance of Interpretability
• Interpretability permits confidence that the predictive capacity of a model is reliable (accurate and equitable),
and thus is fit to be deployed.
• Interpretability poses little issue in low-risk scenarios. E.g. If a model is recommending movies to watch,
that can be a low-risk task (for recipient).
• If a model decides on a propensity to recidivism/ if a person has cancer/ if money laundered, these are
high-risk scenarios where interpretability is critical for engendering trust.
• As data scientists and financial institutions are held accountable for model mistakes, a necessary
step will be to interpret and validate a model (and training data) before it is deployed.
• Expectation: Models will ultimately provide not only a prediction and performance metrics but also an
interpretable account of a decision or indecision (e.g. self driving cars, Alexa the personal assistant).
36
Model Explainability vs Model interpretability
• Explainability, in contrast, is about what each and every aspect of a model represents (e.g. a node
in a neural network) and its importance to a model’s performance. (Term of transparency is
sometimes used instead of explainability.)
• Note: Chris Molnar in his book ’ A guide for making black box models explainable’ uses the terms
interpretable and explainable interchangeably.
37
The Trade-Off Between Prediction Accuracy and Model
Explainability.
Sources: US Department of Defense’s ’Explainable AI’ and IIF report Nov 2018, Explainability in Predictive Modeling.
38
Predictive Accuracy: e.g. the proportion of correct predictions in the test (training) set.
Black box algorithm: Correlation v Causality
• Machine learning can play an integral role in tests for causality but it pertains directly only to
correlation.
• In causality tests, machine learning models can estimate a ’counterfactual’ scenario.
• Machine learning requires expert business knowledge insight whether used in tests for correlation
or causality.
• Commercially, the cost in time and resources of testing for causality is usually prohibitive.
• Key reference: Varian, 2014; Big Data: New Tricks for Econometrics. Journal of Economic
Perspectives.
39
Causality and Correlation
• Machine learning is principally about correlation and prediction, typically it has little to directly
say about causality.
• Illustration of Correlation vs Causation
• If we look at the historical data, using extrapolating machine learning models, for instance; we
find that the number of compliance officers deployed is strongly positively associated with
rule violations.
• But this doesn’t (necessarily!) mean that assigning additional compliance officers causes more
rule violations.
40
Is establishing causality unnecessary? Correlations often
suffice.
• Not necessary to understand why a mechanism works as it does: why is a customer loyal?, why
does an individual commit a fraud?
• We might not know why an outcome arises but it can still be highly valuable to know what
happens next
• Amazon & Netflix: recommendation algorithms
• Google flu trends: propagation of a virus
• Electrocardiograms for premature baby and adult fatality predictions
• Banking industry: machine learning informed alert models.
41
Is establishing causality unnecessary? Correlations often
suffice.
• Tests of causality are about understanding why a particular outcome
• takes place: why does a customer churn? why do ales figures increase?
• This can be useful for two main reasons:
1. To identify features for input to a machine learning model
• Machine learning builds on domain area expertise: Domain experts can suggest predictors and
machine learning can discern between these predictors and select useful functions of them for
making predictions.
• Machine learning can substitute for, in significant part, professional skills and years of
experience. It can free up time for skilled personnel to engage in more challenging problems.
42
Is establishing causality unnecessary? Correlations often
suffice.
2. To estimate the causal impact of an intervention can prove a highly pragmatic and commercially
relevant data related question.
For instance,
• Can the recruitment of compliance officers reduce rule violations?
• Can an increased budget for advertising increase sales?
• The case for an end to theory (i.e. subject experience and domain area knowledge) is unpersuasive
43
Performance evaluation of Alert Models
A Manager’s Guide to Alert Model Performance Evaluation
• This module focuses on binary classifiers: i.e. Alert Models.
• Our initial measures of accuracy are the training and test classification error rates.
• In first instance, let’s think of a good classifier as one which has a small classification test error rate
• A bias-variance trade-off concept proves critical to selecting model flexibility in this setting.
• Finally we turn to commercially sensitive performance measures: e.g., true positive and false
positive rates as calculated in a confusion matrix.
44
Assessing Model Accuracy: Error Rates
Test (Training) classification error rate: the proportion of incorrect
predictions in the test (training) set
45
Assessing Model Accuracy: Test Error Rates
• To quantify the accuracy of the estimate 𝑓 , we use the training error classification rate
• 𝑦 is the predicted class label (using 𝑓 ) for the ith observation in the training sample
• y is the actual label observed in the data
• 𝐼(𝑦𝑖 ≠ 𝑦𝑖 ) is an indicator variable that equals 1 if 𝑦𝑖 ≠ 𝑦𝑖 and 0 if 𝑦𝑖 = 𝑦𝑖
• The test error classification rate associated with a set of test observations of the form (𝑥0 , 𝑦0 ) is
given by 𝐴𝑣𝑒(𝐼(𝑦0 ≠ 𝑦0 ) .
46
Illustrative classification algorithm: K Nearest Neighbours
• K-Nearest Neighbors, KNN
The K-Nearest Neighbors Classifier estimates the conditional probability for class 𝑗 as a fraction of
points in 𝑁0 whose response values equal 𝑗
fraction of observations in the ’neighbourhood’ of a test observation with a specific class j outcome.
KNN algorithm classifies a test observation such that it belongs to the most commonly-occurring
class.
Next slide: Let’s consider two classes of outcome: orange and blue circles. Imagine that the axes
(next slide) correspond to features: ’feature space’ and that there is a hypothetical point ’x’ in this
space to which we wish to ascribe a class of outcome. Let’s use KNN, K=3.
47
KNN Algorithm in action
The KNN approach, using K = 3, is illustrated in a simple situation with six blue observations and six orange
observations. Left: a test observation at which a predicted class label is desired is shown as a black cross, x. The
three closest points to the test observation are identified, and it is predicted that the test observation belongs to
the most commonly-occurring class, in this case blue. Right: The KNN decision boundary for this example is
shown in black. The blue grid indicates the region in which a test observation will be assigned to the blue class,
and the orange grid indicates the region in which it will be assigned to the orange class.
48
Performance of the K-NN algorithm: In a simulation setting
• Let’s examine the performance of the K-NN algorithm, i.e. the training and test errors across as we allow
model flexibility, K, to vary.
• Broken line (technically a so-called ’Bayes decision boundary’) in feature space is analogous to the true
model f, where errors in regard to the boundary represent irreducible error.
A simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The
purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in
which a test observation will be assigned to the orange class, and the blue background grid indicates the region
in which a test observation will be assigned to the blue class. The Bayes error rate is analogous to the irreducible
error. The Bayes classifier is an optimal classifier which is possible as we know the true data generating process.
49
The Classification Setting
The black curve indicates the KNN decision boundary on the data from Figure 2.13, using K = 10. The Bayes
decision boundary is shown as a purple dashed line. The KNN and Bayes decision boundaries are very similar.
50
The Classification Setting
A comparison of the KNN decision boundaries (solid black curves) obtained using K = 1 and K = 100 on the
data from Figure 2.13. With K = 1, the decision boundary is overly flexible – low bias and high variance, while
with K = 100 it is not sufficiently flexible - high bias and low variance. The Bayes decision boundary is shown
as a purple dashed line.
51
The Classification Setting: A “U” shape in the test error curve
The KNN training error rate (blue, 200 observations) and test error rate (orange, 5,000 observations) on the data
from Figure 2.13, as the level of flexibility (assessed using 1/K) increases, or equivalently as the number of
neighbors K decreases. The black dashed line indicates the Bayes error rate. The jumpiness of the curves is due
to the small size of the training data set.
52
Download