Uploaded by tungtbuicc72

Fraud Detection

advertisement
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
---🙠🕮🙢---
REPORT
PROJECT: Introduction to Business Analytics
Instructor: PhD. Nguyen Binh Minh
Group: 3
Member: Nguyen Hai Long
20204920
Bui Thanh Tung
20204931
Nguyen Huy Hai
20200194
Duong Vu Tuan Minh 20209705
Index
I.
II.
III.
IV.
V.
VI.
Introduction
Problem description
3
3
About the dataset
Designing System
Classification Models
Conclusion
4
9
10
1.
Introduction
Fraudulent activities in loan applications pose a significant threat to financial institutions,
necessitating the development of robust fraud detection systems. In this study, we
investigate a comprehensive set of applicant information, loan details, credit history, property
information, document verification, and social surroundings using the Kaggle Credit Card
Fraud dataset. The dataset, available at
https://www.kaggle.com/datasets/mishra5001/credit-card, provides a diverse and realistic
representation of credit-related transactions and allows us to enhance the accuracy and
effectiveness of fraud detection in loan applications.
The applicant information section includes demographic details, family-related information,
and employment details, providing insights into the applicant's background and financial
stability. Loan details encompass key factors such as loan type, credit amount, annuity, and
goods price, shedding light on the financial aspects of the loan request. Credit history,
evaluated through external scores and late payment records, offers a historical perspective
on the applicant's creditworthiness.
Property information, focusing on property type, ownership, and building features,
contributes to understanding the collateral and associated risks. Document verification flags
aid in assessing the authenticity of information provided by the applicant. The social
surroundings section captures observations of the client's social environment and defaults
over time, offering a dynamic perspective on risk.
Additionally, factors such as the duration since the last phone change and the presence of
mobile phones and emails are considered in the analysis. This multifaceted approach aims
to create a comprehensive fraud detection model capable of identifying irregularities and
potential fraudulent activities across diverse dimensions of loan applications.
By integrating various indicators and leveraging advanced analytics techniques on the
Kaggle Credit Card Fraud dataset, our study aims to contribute to the development of a
more resilient fraud detection system, ultimately safeguarding financial institutions from
fraudulent loan applications. The findings of this research have the potential to enhance risk
assessment methodologies and promote a more secure lending environment. The work can
be found at: https://www.kaggle.com/code/dngvminh/fraud-detection.
2.
Problem description
In today's dynamic and interconnected business landscape, fraud has emerged as a
pervasive challenge across various industries, including banking, sales, and insurance.
While credit card fraud remains one of the most prevalent forms of illicit activities, the
spectrum of fraudulent practices has expanded to include identity theft and cyber-attacks.
The inherent complexity of fraud, coupled with the constantly evolving and sophisticated
strategies employed by fraudsters, necessitates innovative and adaptive solutions for
detection.
Addressing the multifaceted nature of fraud requires a nuanced understanding of the
underlying patterns and anomalies within datasets. Among the diverse industries grappling
with this issue, the financial sector is particularly vulnerable, as fraudulent transactions can
have severe financial repercussions and erode trust in financial systems.
In response to this pervasive challenge, this project aims to contribute to the arsenal of tools
available for fraud detection. By focusing on the domain of online transactions, specifically
utilizing the "Credit Card Fraud Detection" dataset available on Kaggle
(https://www.kaggle.com/datasets/mishra5001/credit-card), we endeavor to employ statistical
methods to discern patterns indicative of suspicious operations. This project is not merely an
academic exercise but a pragmatic endeavor aimed at empowering business analysts to
make informed decisions that safeguard their organizations against financial losses and
reputational damage.
The proactive identification of features or combinations of features that are common to
fraudulent transactions holds the key to timely detection and mitigation. Armed with such
insights, companies can strengthen their fraud prevention mechanisms, enhancing their
ability to thwart fraudulent actions before they inflict substantial harm. In essence, this
project serves as a valuable asset for business analysts, providing them with practical
experience in leveraging data analytics for fraud detection and prevention – a skill set
increasingly indispensable in today's dynamic business environment.
3.
About the dataset
The
"Credit
Card
Fraud
Detection"
dataset,
available
on
Kaggle
(https://www.kaggle.com/datasets/mishra5001/credit-card), serves as a valuable resource for
exploring and understanding fraudulent activities in online credit card transactions.
The dataset contains a wealth of information related to credit card transactions,
encompassing diverse features such as applicant demographics, loan details, credit history,
property information, document verification, and social surroundings.
The variety of features enables a comprehensive analysis of different aspects related to
fraud detection.
Imbalanced Class Distribution:
One of the notable characteristics of the dataset is its highly imbalanced class distribution.
Specifically, only a small percentage (7%) of the transactions are labeled as fraudulent, while
the majority belong to the non-fraudulent class.
Imbalanced datasets present a common challenge in fraud detection, as models might
exhibit a bias toward the majority class. Special attention and techniques are required to
address this imbalance.
Real-World Implications:
The imbalanced nature of the dataset reflects the real-world scenario where fraudulent
transactions are infrequent but have significant consequences.
Analyzing and addressing class imbalance contributes to the creation of robust fraud
detection models capable of identifying and preventing illicit activities.
4.
Designing system
4.1. Distribution of Fraud and Non-Fraud Instances
In the real-world landscape of financial transactions, fraudulent activities consistently
represent a small fraction of the entire data space. This characteristic is mirrored in the
"Credit Card Fraud Detection" dataset, where instances of fraud transactions account for
approximately 7% of the entire dataset. The scarcity of fraudulent cases, while reflective of
the genuine nature of these occurrences, adds a layer of complexity to the task of building
effective fraud detection models.
The imbalanced distribution of classes, with the majority of transactions being
non-fraudulent, underscores the need for sophisticated analytical approaches. As fraud
detection algorithms are trained to discern patterns and anomalies, the inherent rarity of
fraudulent instances necessitates a heightened sensitivity to subtle signals that might
indicate potential illicit activities.
4.2. Feature selection
In the pursuit of refining the dataset for robust model training and analysis, a comprehensive
feature selection process was undertaken, guided by the overarching objective of enhancing
the quality and interpretability of the dataset. One critical facet of this process involved
addressing the presence of missing values within the dataset. Columns with a significant
proportion of missing data, exceeding a predetermined threshold of 60%, were identified and
subsequently removed. This strategic decision aimed to alleviate potential biases stemming
from incomplete information, thereby contributing to a more reliable and comprehensive
dataset for subsequent analyses.
In addition to mitigating missing values, a deliberate effort was made to streamline the
dataset's structure and reduce dimensionality through the consolidation of related columns.
Specifically, a group of binary indicator columns, ranging from FLAG_DOCUMENT_2 to
FLAG_DOCUMENT_21, were amalgamated into a singular variable named
FLAG_DOCUMENT. This consolidation not only served to simplify the dataset but also
encapsulated the collective information conveyed by the individual flags, providing a more
cohesive representation of the underlying features.
Recognizing the significance of addressing multicollinearity to ensure model stability and
prevent redundancy, an in-depth analysis of attribute correlations was conducted. Highly
correlated attributes, indicative of redundant or overlapping information, were systematically
identified and pruned from the dataset. This meticulous curation of features aimed not only
to enhance the dataset's discriminative power but also to foster a more efficient and
streamlined input for subsequent model training and evaluation.
In summary, the feature selection process undertaken in this phase of the analysis
represents a judicious effort to refine the dataset, striking a balance between preserving
valuable information and mitigating potential sources of bias or redundancy. These strategic
decisions lay the foundation for subsequent stages of the project, where the curated dataset
will be utilized for developing and fine-tuning machine learning models for fraud detection.
4.3. Positive targets on each values of categorical
attributes
In the pursuit of a granular understanding of our machine learning model's performance, a
comprehensive analysis was conducted by calculating true positive targets for each distinct
value within categorical attributes. This approach goes beyond the conventional evaluation
metrics, offering a nuanced view of the model's efficacy in identifying positive instances
across specific subgroups. By breaking down the true positive rates at the categorical level,
we gained valuable insights into the discriminative power of our model, identifying categories
that significantly contribute to positive outcomes. This method not only aids in highlighting
the strengths of the model in accurately predicting positives but also sheds light on potential
weaknesses, especially in scenarios with imbalanced datasets or where certain
subcategories hold particular importance. The subsequent visualization of these true positive
rates provides a clear and interpretable representation of the model's performance across
diverse categories, facilitating informed decision-making and targeted improvements in our
classification task.
●
Findings:
1. Geographic Discrepancies: Registration vs. Work/Live Regions
Customers with different city/region registrations and work/live regions show higher
fraudulent tendencies.
2. Family Dynamics: Children and Loan Repayment
Applicants with more children face increased challenges in loan repayment, leading to a
higher likelihood of fraud.
3. Socioeconomic Influence: Standard of Living and Fraudulent Transactions
Residents in higher standard areas/regions exhibit a decreased likelihood of engaging in
fraudulent transactions.
4. Loan Type Disparities: Cash vs. Revolving Loans
Applicants for Cash loans are more prone to fraud compared to those applying for Revolving
loans.
5. Gender Disparity: Male vs. Female Fraud Likelihood
Males are more likely to commit fraud than females among loan applicants. Meanwhile,
business man and student, have no trouble paying the loans.
6. Property Ownership: Limited Impact on Fraud Incidence
Ownership of cars or properties does not significantly affect the likelihood of committing
fraud.
7. Employment Status: Unemployment and Maternity Leave Risks
Unemployed or maternity leave applicants demonstrate a significantly higher rate of
committing fraud.
8. Academic Background: Influence on Fraud Rates
Lower academic qualifications correlate with higher fraud rates, with a notable decrease as
educational levels rise. Academic degree holders display a mere 1.8% chance of fraud.
9.
Occupational Disparities: High vs. Low-Skilled Jobs
Low-skilled jobs, such as security staff, waiters/barmen, and laborers, are associated with
almost double the chance of fraud compared to high-skilled IT workers.
10. Industry Sectors: Fraud Likelihood Variation
Certain sectors, including transport and industry, pose a higher risk of fraudulent
transactions compared to the educational sector, particularly in university settings.
4.4. Removing attributes that are highly correlated
In order to enhance the robustness of our analytical model and mitigate the potential for
multicollinearity, a careful examination of attribute correlations was undertaken, leading to
the removal of highly correlated features. The identification and subsequent elimination of
these correlated attributes were pivotal steps in refining the dataset for our analysis.
Multicollinearity, where two or more attributes are strongly correlated, can introduce
instability and challenges in accurately assessing the individual impact of each attribute on
the outcome. By strategically removing these highly correlated attributes, we aimed to
ensure that our model remains more interpretable and generalizable, ultimately improving its
predictive performance. This process of attribute selection contributes to a more streamlined
and efficient model, reducing redundancy and enhancing the model's capacity to discern
meaningful patterns within the data. The resulting dataset, pruned of correlated attributes,
sets the foundation for a more accurate and reliable analysis of the factors influencing the
outcomes under consideration.
4.5. Correlation of Continuous Variables with TARGET.
The calculation of the correlation between continuous variables and the target variable is a
pivotal aspect of our analysis, shedding light on the relationships that exist between
individual attributes and the target outcome. Through this process, we seek to quantify the
degree and direction of the linear association between each continuous variable and the
target variable. A positive correlation implies that as the continuous variable increases, the
likelihood of a positive outcome in the target variable also increases, while a negative
correlation suggests an inverse relationship. This correlation analysis serves as a crucial
step in identifying influential features that may significantly impact the target variable. By
understanding the strength and nature of these relationships, we gain valuable insights into
the factors that play a role in determining the target outcome. These correlation coefficients
not only guide feature selection but also contribute to the overall interpretability and
predictive power of our model. This comprehensive examination of continuous variables in
relation to the target variable lays the groundwork for a nuanced understanding of the
underlying patterns within our dataset.
5. Classification Model
In the pursuit of developing a robust and accurate predictive model, the preprocessed data
underwent training using four distinct classification models: Logistic Regression, K-Nearest
Neighbors, Decision Tree Classifier, and Gaussian Naive Bayes. Prior to model training, the
dataset was subjected to oversampling through the Synthetic Minority Over-sampling
Technique (SMOTE) following the data split, a measure aimed at addressing potential class
imbalances. Additionally, missing value imputation was carried out post-data split to ensure
the integrity of the training process. After comprehensive training and evaluation, it was
observed that the Decision Tree Classifier exhibited the highest test accuracy among the
models, reaching an impressive 0.82 accuracy. This outcome underscores the effectiveness
of the Decision Tree model in capturing complex relationships within the data and highlights
its potential as a strong candidate for further refinement and deployment in predictive tasks.
Train accuracy
Test accuracy
Logistic Regression
0.56
0.47
Gaussian Naive Bayes
0.61
0.60
K-Nearest Neighbors
0.89
0.67
Decision Tree Classifier
1.00
0.82
6. Conclusion
The pursuit of effective fraud detection involves navigating the intricate landscape of
imbalanced data, comprehensive feature selection, and rigorous model evaluation. In our
exploration of fraud detection using the "Credit Card Fraud Detection" dataset, several key
insights have emerged, shaping a comprehensive understanding of the factors influencing
fraudulent activities.
The initial acknowledgment of the imbalanced distribution of fraud and non-fraud instances
emphasizes the necessity for nuanced analytical approaches. With fraudulent transactions
constituting a mere 7% of the dataset, the challenge lies in developing models attuned to the
subtleties that characterize these rare occurrences.
A meticulous feature selection process was undertaken to refine the dataset, addressing
missing values, consolidating related columns, and pruning highly correlated attributes. This
strategic curation not only enhances the quality and interpretability of the dataset but also
lays a robust foundation for subsequent model training.
Our granular analysis of positive targets within categorical attributes provided a nuanced
view of the model's efficacy across diverse subgroups. The findings unveiled geographic
discrepancies, family dynamics, socioeconomic influences, loan type disparities, gender
disparities, and various socio-demographic factors influencing fraud likelihood.
The exploration of continuous variables' correlation with the target variable further enriched
our understanding, highlighting influential features that contribute to the target outcome.
Factors such as academic background, employment status, and industry sectors
demonstrated varying degrees of impact on fraud likelihood.
The culmination of our efforts involved training four classification models, with the Decision
Tree Classifier emerging as the standout performer with an impressive test accuracy of 0.82.
The model's adeptness in capturing complex relationships within the data positions it as a
strong candidate for further refinement and deployment in practical fraud detection
scenarios.
In summary, our comprehensive approach, from data preprocessing to model evaluation,
has provided valuable insights into the dynamics of fraud within financial transactions. By
addressing the challenges posed by imbalanced data and employing sophisticated analytical
techniques, our findings pave the way for the development of robust fraud detection systems
capable of navigating the intricacies of real-world financial landscapes.
Download