Review of Fraud Classification Using Principal Components Analysis of RIDITS

advertisement
Review of
Fraud Classification Using Principal
Components Analysis of RIDITS
By Louise A. Francis
Francis Analytics and Actuarial Data Mining, Inc.
Objectives
Address question: Why use new method,
PRIDIT?
Introduce other methods used in similar
circumstances
Explain how PRIDIT adds to methods
available
Explain limitations of PRIDIT/RIDIT
A Key Problem in Fraud Modeling
Most data mining methods need a target
(dependent) variable
Y = a + b1x1 + b2x2 + … bnxn
Fraud (Yes/No or Fraud Score) = f(predictor
variables)
Need sample of data where claims have been
determined to be fraudulent or legitimate
Dependent variable hard to get
In a large sample of automobile insurance
claims perhaps 1/3 may have an element of
abuse or fraud
Scarce resources are not expensed on such
large volumes of claims to determine their
legitimacy
Only a small percentage referred to SIU
investigators or other investigations
There are time lags in determining the outcome
of investigations
Unsupervised learning
Another approach that does not require a
dependent variable
Two Key Kinds
Cluster Analysis
Principal Components/Factor Analysis
Pridit uses this approach
It is applied to ordered categorical variables
Cluster Analysis
Records are grouped in categories that have similar
values on the variables
Examples
Marketing: People with similar values on demographic
variables (i.e., age, gender, income) may be grouped
together for marketing
Text analysis: Use words that tend to occur together to
classify documents
Note: no dependent variable used in analysis
Clustering
Common Method: k-means, hierarchical
No dependent variable – records are grouped
into classes with similar values on the variable
Start with a measure of similarity or
dissimilarity
Maximize dissimilarity between members of
different clusters
Dissimilarity (Distance)
Measure – Continuous
Variables
Euclidian Distance
dij 


1/ 2
m
2
( xik  x jk )
i, j = records k=variable
k 1
Manhattan Distance
dij 

m
xik  x jk
k 1

Column
Variable
Binary Variables
Row Variable
1
0
1 a
b a+b
0 c
d c+d
a+c b+d
Binary Variables
Sample Matching
bc
d
abcd
Rogers and Tanimoto
2(b  c)
d
(a  d )  2(b  c)
Example: Fraud Data
Data from 1993 closed claim study conducted by
Automobile Insurers Bureau of Massachusetts
Claim files often have variables which may be useful
in assessing suspicion of fraud, but a dependent
variable is often not available
Variables used for clustering:
Legal representation
Prior Claim
SIU Investigation
At fault
Police report
Number of providers
Statistics for Clusters
 Based on descriptive statistics, Cluster 2 appears to
have higher likelihood of fraudulent claims – more
about this later
Police Medical At Legal
SIU
Number
Cluster Report Audit Fault Rep Investigation Providers
Percentage Yes
1 46.7% 0.1% 42.2% 6.1%
0.0%
2
2 49.8% 5.9% 2.4% 96.0%
6.5%
4
Principal Components Analysis
A form of dimension (variable) reduction
Suppose we want to combine all the information
related to the “financial” dimension of fraud
Medical provider bill (indicative of padding claim)
Hospital bill
Number of providers
Economic Losses
Claimed wages
Incurred Losses
Principal Components
These variables are correlated but not
perfectly correlated
We replace many variables with a weighted
sum of the variables
Correlation Matrix for Variables
Correlations
Number Medical Provider Economic
Hospital
Providers
Bill
Paid
Losses Incurred
Pymt
Number
Providers
1.000
0.387
0.571
0.382
0.382
0.168
Medical Bill
Provider
Paid
Economic
Losses
0.387
1.000
0.539
0.952
0.952
0.922
0.571
0.539
1.000
0.531
0.531
0.327
0.382
0.952
0.531
1.000
1.000
0.888
Inourred
Hospital
Pymt
0.382
0.952
0.531
1.000
1.000
0.888
0.168
0.922
0.327
0.888
0.888
1.000
Finding Factor or Component
The correlation matrix is used to find the
factor that explains the most variance
(captures most of the correlation) for the set
of variables
That component or factor extracted will be a
weighted average of the variables
More than one Component or Factor may
result from applying the method
Evaluating Importance of Variables
Use factor loadings
Component Matrix
Variable
Loading
Number Providers
0.497
Medical Bill
0.974
Provider Paid
0.646
Economic Losses
0.976
Incurred
0.976
Hospital Pymt
0.886
Problem: Categorical Variables
It is not clear how to best perform Principal
Components/Factor Analysis on categorical
variables
The categories may be coded as a series of binary
dummy variables
If the categories are ordered categories, you may
loose important information
This is the problem that PRIDIT addresses
RIDIT
Variables are ordered so that lowest value is
associated with highest probability of fraud
Use Cumulative distribution of claims at each
value, i, to create RIDIT statistic for claim t,
value i
Rti 
 pˆ tj
j i
ˆ tj
p
j i
Example: RIDIT for Legal
Representation
Legal Representation
Proportion Proportion
Value
Code Number Proportion
Yes
No
1
2
706
694
0.504
0.496
Below
0.000
0.504
Above
RIDIT
0.496 -0.496
0.000 0.504
PRIDIT
Use RIDIT statistics in Principal Components
Analysis
Component Matrixa
Component
1
S IU
.2 48
Poli ce Report
.2 20
At Fault
.7 09
Leg al Rep
.7 52
Medical Audit
.3 41
Prior Cl ai m
.4 06
Extracti on Method: Princi pal Component Analysi s.
a. 1 co mpo nent s ext racted.
Scoring
Assign a score to each claim
The score can be used to sort claims
More effort expended on claims more likely to be
fraudulent or abusive
In the case of AIB data, we can use additional
information to test how well PRIDIT did,
using the PRIDIT score
A suspicion score was assigned to each claim by
an expert
PRIDIT vs. Suspicion Score
Suspicion Score vs PRIDIT Score
0.50
(1.00)
(1.50)
Suspicion Score
10
.0
0
9.
00
8.
00
7.
00
6.
00
5.
00
4.
00
3.
00
2.
00
(0.50)
1.
00
0.00
0.
00
PRIDIT Score
1.00
Clustering and Suspicion Score
Report
Mean
TwoStep
Cluster Number
1
Sus picion
Level
.6445
2
3.3737
Total
1.9643
Result
There appears to be a strong relationship
between PRIDIT score and suspicion that
claim is fraudulent or abusive
The clusters resulting from the cluster
procedure also appeared to be effective in
separating legitimate from fraudulent or
abusive claims
Comparison: PRIDIT and Clustering
PRIDIT gives a score, which may be very
useful for claims sorting. Clustering assigns
claims to classes. They are either in or out of
the assigned class.
Clustering ignores information about the
order of values for categorical variables
Clustering can accommodate both categorical
and continuous variables
Comparison
Unordered categorical variables with many
values (i.e., injury type):
Clustering has a procedure for measuring
dissimilarity for these variables and can use them
in clustering
If the values for the variables contain no
meaningful order, PRIDIT will not help in
creating variables to use in Principal Components
Analysis.
Review of
Fraud Classification Using Principal
Components Analysis of RIDITS
By Louise A. Francis
Francis Analytics and Actuarial Data Mining, Inc.
Download