Uploaded by mpastorg

CSCIE89 Final Project mpg

advertisement
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
CSCIE89 Final Project. Pastor, Marcos.
Fraud in credit card transactions.
Problem Statement.
Every card transaction submitted to the payment network for approval can potentially be a
fraudulent transaction, the business necessity to predict this feature is very high and the amount
of available data around the transaction and the billions of transactions submitted every day
makes this a very good candidate for a Machine Learning solution. Some of the available
business features that will help with this prediction are amount, card present versus remote, debit
or credit, expiration date, cardholder personal data…
Dataset:
The only public dataset that I found is a very well-known dataset in Kaggle with anonymized
credit card transactions. https://www.kaggle.com/mlgulb/creditcardfraud/downloads/creditcardfraud.zip/3
This dataset presents transactions that occurred in two days, where we have 492 frauds out of
284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for
0.172% of all transactions.
The main constraint is that we lose all the possible business knowledge and intuition that can be
applied with the meaning of the original variables.
Hardware:
Intel Nuc7i7bnh, Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz with 4 Cores, 32 GB DDR4
GTX 1070 Graphic Card GV-N1070IXEB-8GD eGPU
Software:
Ubuntu 18.04.2 LTS
Python 3.6.7, pip 19.1, venv with tensorflow-gpu==1.13.1 and many others
NVIDIA-SMI 418.43
Driver Version: 418.43
CUDA Version: 10.1
Strategy:
I’ll use a simple Sequential Deep Learning Network with Keras to predict the fraud transactions
and I’ll play with the hyperparameters to obtain the best possible solution. Because of the highly
unbalanced of the data, we’ll use the confusion matrix and the precision and recall for success
measurement.
I’ll also solve the same problem using Auto Encoders and compare the results, this will make the
final solution more robust to business questions. Auto Encoders provides a fresh view where the
network is trained only using normal transactions, and detecting the fraud is a high abnormal
value of the measurement of the distance between the original transaction and the reconstruction.
YouTube:
https://youtu.be/PSKYrhpp70g
https://youtu.be/VHPsXJePL_c
1
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
Problem understanding.
Any highly unbalanced problem has the main concern about predicting always the value with
bigger weight and get naive very good accuracy, in this example, if we take any percentage for
testing in a perfect linear way, and we always predict normal (not fraud) transaction, the
accuracy would be 1 – 0.00173 = 0.9983. The primary goal is to create a model that can predict
fraud transactions, improve the accuracy is not the goal. That’s the reason that we’re adding
precision and recall measures.
cc = pd.read_csv('creditcard.csv')
cc.head()
# visualize correlation between all variables
corr = cc.corr()
sns.heatmap(corr)
#Quick view of correlation between variable Class and rest ov variables sorted
2
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
corr.Class.sort_values()
No high correlation between any variables and Class, and the ones with some correlation don’t
have any special meaning, we can see also that Time and Amount won’t have any special impact
by themselves to predict a fraud transaction.
V17
-0.326481
V14
-0.302544
V12
-0.260593
V10
-0.216883
V16
-0.196539
V3
-0.192961
V7
-0.187257
V18
-0.111485
V1
-0.101347
V9
-0.097733
V5
-0.094974
V6
-0.043643
Time
-0.012323
V24
-0.007221
V13
-0.004570
V15
-0.004223
V23
-0.002685
V22
0.000805
V25
0.003308
V26
0.004455
Amount
0.005632
V28
0.009536
V27
0.017580
V8
0.019875
V20
0.020090
V19
0.034783
V21
0.040413
V2
0.091289
V4
0.133447
V11
0.154876
Class
1.000000
Name: Class, dtype: float64
Based in a simple manual observation of the data, we have 28 unknown variables result of a PCA
transformation that should be useful to predict the fraud or not of a transaction, the amount that
we’ll normalize to train the model, and time.
Time by itself don’t add a business intuition to make fraud decisions unless it’s tight to person by
person patterns, having only 2 days of information and with no information linked to specific
users, suggest that this variable could be not useful, we’ll drop it from the model.
Class is the dependent variable that we want to predict.
#Read the data, normalize amount, and drop original amount and time for now
cc['n_amount'] = StandardScaler().fit_transform(cc['Amount'].values.reshape(-1,1))
cc = cc.drop(['Amount'],axis=1)
cc = cc.drop(['Time'],axis=1)
cc.head()
3
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
Model sequential solution.
My first approach for the network architecture was a naive sequential architecture with fully
connected layers. In this case, we are ignoring that we almost don’t have any data for the fraud
transactions and in therory, it would be difficult for the network to learn how to differenciate the
normal from the fraud transactions.
The results were better than expected, the training was overall very fast and I adjusted the
network using the loss function, the activation function, the optimizer and trying diverse sizes of
a similar network architecture.
In order to compare different options and choose the network with the best performance, I was
using the loss of the training and the confusion matrix applied to the test dataset. I tried to use the
validation loss, but the model was not performing correctly.
my_pred = model.predict(x_test)
my_score = model.evaluate(x_test, y_test)
confusion_compare = (my_pred > 0.5)
cm = confusion_matrix(y_test, confusion_compare)
print(cm)
We’ll see in the next results of the different tests this type of confusion matrix:
[[a1 a2]
[b1 b2]]
a1 = transactions predicted as normal that were actually normal
b2 = transactions predicted as fraud that were actually fraud
a2 = false positive, transactions predicted as not-fraud that were actually fraud
b1 = false negative, transactions predicted as fraud that were actually not-fraud
The sizes that worked better for my network were with 64 and 128 neurons with 2 layers, and
128, 256 and 512 with 3 layers, some other sizes that I tried with 3 layers were 64-128-64 and
64-128-256.
I tested different activation functions, for example ReLU and LeakyRelu.
my_alpha = 0.05
model = Sequential()
model.add(Dense(128,input_dim=29))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(256))
model.add(LeakyReLU(alpha=my_alpha))
4
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
model.add(Dense(512))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =
['accuracy'])
LeakyRelu with alpha = 0.05
Epoch 48/100
227845/227845 [==============================] - 9s 38us/step - loss: 0.0016 - acc:
0.9998
Epoch 00048: early stopping
Confusion Matrix:
[[56829
[
26
30]
77]]
LeakyRelu with my_alpha
= 0.01
Epoch 37/100
227845/227845 [==============================] - 9s 38us/step - loss: 0.0016 - acc:
0.9997
Epoch 00037: early stopping
Confusion Matrix:
[[56850
[
27
9]
76]]
I also tested the model with RMSprop, Adam and Adagrad. RMSProp performed very poorly and
discarded in the first attempt. Adam was performing good, as we can see in the previous results,
but Adagrad was the one with better results in terms of loss and Confusion Matrix:
my_alpha = 0.01
model = Sequential()
model.add(Dense(128, input_dim=29))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(256))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(512))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer = 'adagrad', loss = 'binary_crossentropy', metrics =
['accuracy'])
Epoch 96/100
227845/227845 [==============================] - 8s 34us/step - loss: 1.3001e-04 acc: 1.0000
Epoch 00096: early stopping
Confusion Matrix:
[[56856
[
24
3]
79]]
Although the training seems not to be overfitting, we’ll test also adding a regularizer and check if
the model can train more epochs and learn better weights. Adding the regularizer to the 3 Dense
layers didn’t work as expected, the loss was having problems to keep going down and the
progress was slower than expected. I finally kept the regularizer in the last Dense layer and train
again:
my_alpha = 0.01
5
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
model = Sequential()
model.add(Dense(128, input_dim=29))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(256))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(512,kernel_regularizer = regularizers.l2(0.001)))
model.add(LeakyReLU(alpha=my_alpha))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer = 'adagrad', loss = 'binary_crossentropy', metrics =
['accuracy'])
Results after 200 epochs:
Epoch 100/100
227845/227845 [==============================] - 9s 38us/step - loss: 5.6172e-04 acc: 0.9999
Confusion Matrix:
[[56854
[
24
5]
79]]
Results after 400 epochs:
Epoch 200/200
227845/227845 [==============================] - 8s 35us/step - loss: 3.0213e-04 acc: 1.0000
Confusion Matrix:
[[56854
[
24
5]
79]]
I concluded that regularization is not adding any value and we can obtain the same results faster
with less training if we remove the regularization.
Now I’ll try again previous training with 200 epochs, increasing the patience for early stopping
to 30 steps and using the parameter restore_best_weights = True to see if we can improve the
results. Actually, the resulst were slightly interesting:
Epoch 200/200
227845/227845 [==============================] - 8s 35us/step - loss: 1.7455e-04 acc: 1.0000
Confusion Matrix:
[[56850
[
22
9]
81]]
Classification Report:
precision
0
1
1.00
0.90
recall
1.00
0.79
f1-score
support
1.00
0.84
56859
103
I finally decided that although the given time variable makes no sense, none of the others have
also any meaning, therefore I kept the same network architecture, I normalized time and I
6
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
repeated the training. The results are one of the best after training, but the differences are not
significant:
Epoch 200/200
227845/227845 [==============================] - 7s 33us/step - loss: 1.5835e-04 acc: 1.0000
Confusion Matrix:
[[56857
[
26
2]
77]]
Classification Report:
precision
recall
f1-score
support
0
1
1.00
0.97
1.00
0.75
1.00
0.85
56859
103
micro avg
macro avg
weighted avg
1.00
0.99
1.00
1.00
0.87
1.00
1.00
0.92
1.00
56962
56962
56962
Autoencoder solution.
An alternative approach would be to think that the fraud transactions are anomalies and look for
a solution based in autoencoders detecting the anomalies during the reconstruction.
During this approach, I will train the network only using normal transactions, and therefore,
we’ll have a model that should be able to reconstuct as close as possible only normal
transactions.
The goal is in the validation phase, feed into the network normal and fraud transactions, if the
model performs correctly, it should be able to reconstruct normal transactions very good, but
when the model tries to reconstruct a fraud transactions, it should perform worse, therefore we
can measure the distance between the reconstructed transaction and the original transactions, and
above certain threshold, decide that a transactions is an anomaly and therefore is fraudulent.
I kept all the hyperparameters and decisions used for the Sequential model, our optimizer is
Adagrad, I tested both loss = ‘binary_crossentropy’ and ‘mse’, in theory, ‘mse’ should perform
better, but that was not the case, although the difference is not significant. The values of loss are
cominn negatives, I can solve it using ‘mse’ instead of ‘binary_crossentropy’, but the
performance is worse.
I first tested with activation=’relu’ and later with activation=’LeakyReLU’, I didn’t find any
improvement as I found out in the Sequence model. Actually, the implementation of LeakyReLU
is completely different, because it has to be added as a new layer on top of the Dense layer.
I trained the network with several sizes fot the encoder layers, the one that got better results is
the one with encoded layers with sizes 25, 20 y 15, the input layer had 30 values.
7
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
Steps to prepare the data:
#separate data for training and validation,
normal_train = cc[cc.Class == 0]
fraud = cc[cc.Class == 1]
#we only want to train in the normal transactions
x_train_n, x_test_n = train_test_split(normal_train, test_size = 0.2,
random_state=mpg_seed)
x_train = x_train_n.drop(['Class'], axis =1)
x_test = x_test_n.drop(['Class'], axis =1)
# for the reconstruccion we'll combine normal and fraudulent transactions, the model
should be able to detect the
#anomaly of the fraud transactions
xf_test = x_test_n.append(fraud)
# TODO we need to shuffle, separate the Class with 1 and 0 for reconstruccion and
compare
xf_test = shuffle(xf_test)
y_test = xf_test.Class
xf_test = xf_test.drop(['Class'], axis =1)
xf_test.head()
#double check that index is the same
y_test.head()
134042
0
148874
0
147169
0
96893
0
101213
0
Name: Class, dtype: int64
Training looping over different sizes:
input_dim = 30
# possible sizes of our encoded representations
encoding_dims = [27, 25]
encoding_dims_2 = [22, 20]
encoding_dims_3 = [18, 15]
#placeholders
encoding_dim = 20
encoding_dim_2 = 15
encoding_dim_3 = 10
#parameter for LeakyRelu
my_alpha = 0.01
epochs = 100
#to accumulate and compare
id_r, decod_r, xf_r, hist_r = [], [], [], []
#lopp to repeate trainings using different sizes
for encoding_dim in encoding_dims:
for encoding_dim_2 in encoding_dims_2:
for encoding_dim_3 in encoding_dims_3:
8
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
# this is our input placeholder
input_img = Input(shape=(input_dim,))
# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim)(input_img)
encoded_2 = Dense(encoding_dim_2)(LeakyReLU(alpha=my_alpha)(encoded))
encoded_3 = Dense(encoding_dim_3,
activity_regularizer=regularizers.l1(10e-5))(LeakyReLU(alpha=my_alpha)(encoded_2))
# "decoded" is the lossy reconstruction of the input
decoded_3 = Dense(encoding_dim_2)(LeakyReLU(alpha=my_alpha)(encoded_3))
decoded_2 = Dense(encoding_dim)(LeakyReLU(alpha=my_alpha)(decoded_3))
decoded = Dense(input_dim,
activation='sigmoid')(LeakyReLU(alpha=my_alpha)(decoded_2))
# this model maps an input to its reconstruction
autoencoder = Model(input_img, decoded)
print(autoencoder.summary())
autoencoder.compile(optimizer='adagrad', loss='binary_crossentropy',
metrics=['mae'])
early_stopping = EarlyStopping(monitor='loss', min_delta=0, patience=10,
verbose=1, mode='auto')
hist = autoencoder.fit(x_train, x_train,
epochs=epochs,
batch_size=256,
shuffle=True,
validation_data = (x_test, x_test),
callbacks=[early_stopping])
id_str = 'auto_encoder'+str(encoding_dim)+'_'+str(encoding_dim_2) +
'_'+str(encoding_dim_3)+ \
'_'+str(epochs)+'_lrelu_bin_cross.h5'
autoencoder.save_weights(id_str)
decoded_imgs = autoencoder.predict(xf_test)
decod_r.append(decoded_imgs)
xf_r.append(xf_test)
id_r.append(id_str)
hist_r.append(hist)
The complexity now would be to define the treshold when a transaction would be fraud because
its reconstruccion has a large distance of the original values.
Let’s reconstruct the values for all the testing dataset that was including about 20% of the normal
transaction + all the fraud transactions.
For every reconstruccion, lets calculate the distance with the original value, and then we’ll order
the results descending by distance. Instead of deciding a threshold value, we make the decision
of compare our results with the 500 transaction with larger distance.
ind =0
for dd in decod_r:
mse_dd = np.mean(np.power(xf_r[ind] - dd,2), axis =1)
mse_test_label_dd = pd.DataFrame({"mse":mse_dd, "Class":y_test})
mse_test_label_sort = mse_test_label_dd.sort_values("mse", axis = 0, ascending =
False)
print(len(mse_test_label_sort.head(500)[mse_test_label_sort["Class"]==1]), ind,
id_r[ind])
ind = ind + 1
221 0 auto_encoder27_22_18_100_lrelu_bin_cross.h5
219 1 auto_encoder27_22_15_100_lrelu_bin_cross.h5
9
CSCIE89 – Fraud in Credit Card Transactions | Marcos Pastor
217
223
222
219
223
224
2
3
4
5
6
7
auto_encoder27_20_18_100_lrelu_bin_cross.h5
auto_encoder27_20_15_100_lrelu_bin_cross.h5
auto_encoder25_22_18_100_lrelu_bin_cross.h5
auto_encoder25_22_15_100_lrelu_bin_cross.h5
auto_encoder25_20_18_100_lrelu_bin_cross.h5
auto_encoder25_20_15_100_lrelu_bin_cross.h5
We know that we have fraud 492 transactions, therefore, for the case with the best results:
True positives = 224
False positives = 500 – 224 = 276
False negatives = 492 – 224 = 268
Precision = 224 / 500 = 0.448
Recall = 224 / 492 = 0.455
Although this methodology could work in theory, the results demonstrate that performs quite
worse than our sequence model.
10
Download