Uploaded by z-c-ge

Report

advertisement
IDENTIFICATION OF FAKE BANKNOTES
USING K-MEANS CLUSTERING
Andy Zhang
1. THE PURPOSE OF THIS PROJECT
This project tries to analyze banknotes by comparing them with real notes, and then telling if the note
is real or fake. To do this, K-means clustering (which is a kind of unsupervised machine learning
algorithm for finding cluster centers) is applied to classify the banknotes. Since the algorithm is ideal
in both performance and reliability, we hope to bring a more accurate and efficient way of identifying
genuine and forged banknotes.
2. A BRIEF OVERVIEW OF THE PROCEDURE
First, the features from the images of all the notes (including fake ones) are extracted using a tool
named Wavelet, and the features are converted into numbers. They are then compared with the ones
converted from images of real notes. We did not do this ourselves, instead, al of the data are
downloaded from OpenML.
Then, some data processing is performed, which eliminated quite a number of outliers (data for too
large in measuring range). This ensures that the potentially incorrect data cannot interfere with the
results. This is necessary, because the K-means algorithm is sensitive to all outliers, and too many
outliers may cause the algorithm to find the incorrect cluster centers.
Finally, the K-means clustering algorithm is applied to the converted data and two classes (clusters)
are formed. The first class contains all the real notes, while the second one contains all forged ones.
Therefore, by looking up in the results of this project, one can tell whether a specific note has been
forged or not, simply by looking at its parent group. There is also a high accuracy by this algorithm.
This means that he needs not to run it through complicated analysis machines, as people traditionally
do. Hence, the accuracy of fake detection can be improved while the time and effort needed can be
minimized. This is beneficial to all banks and financial institutions.
One can also add more entries to the dataset, and it can still give an accurate result of identification.
This enables it to remain working throughout time. Other currencies will also work, as long as
someone can construct a dataset containing all the converted differences.
3. THE ORIGINAL DATA
The dataset for this project is from OpenML. It summarized the four factors of the differences,
namely variance, skewness, kurtosis and entropy. There are totally 1372 entries of either real or fake
banknotes in the dataset, and all of the images used are captured with a high-definition (400x400
pixels) industrial camera specially designed for print inspection.
In this project, only the variance and skewness are considered. This is because:
A. It is impossible to show all four factors in a graph. The maximum possible number of dimensions
is 3, but 3-d graphs are hard to show in this report, and is equally hard to observe if they are
shown.
B. Although the K-means clustering algorithm theoretically works on any dimensions of data, it does
not perform good when dealing with higher dimensions in real life.
C. The algorithm requires much more computational power to execute, as it involves calculating
distances.
D. As the dimensions rise, the data points become
more sparse and the results can be very unreliable
and inaccurate. Even if it does, it’s impossible to
show them anyway as mentioned in A.
In this report, they are also referred to as “V1” and
“V2” respectively.
4. PROCESSING THE ORIGINAL DATA
The data is processed to remove all outliers. We used
the Z-score method, which removes all values that
are 2 standard derivations away from the mean.
Using 2 standard derivations is a common practice in
statistics.
Figure 1: Scatter of variance and skewness after
A scatter plot of these two factors is shown in Figure
processing.
1. The transparency has been adjusted so that the
clusters can be shown more clearly. As you can see,
there graph roughly shows two areas with the densest points, called “clusters”. However, it is quite
hard to tell by just looking with the naked eye. Therefore, the K-means algorithm has to be used to tell
them apart.
5. CLUSTERING USING THE K-MEANS CLUSTERING ALGORITHM
The K-means algorithm is applied to form two cluster centers, so that the two clusters formed can be
clearly observed.
First, this algorithm generates two cluster centers, which indicates areas that the data points are the
densest. A graph of those points is shown in Figure 2. As you can see, the centers are on either sides of
the graph, one left, one right. The left (orange) one is near the origin, indicating real notes, with
minimal absolute variance and skewness. The right (green) one, however, is further from the origin,
which indicates fake notes.
Then, we can also use the algorithm to form two
classes, each containing real notes and fake notes.
This is called “classification”. In Figure 3, the graph
shows the two classes from the classification in
different colors.
Finally, the classification part is run several times,
until the cluster centers do not move at all. This is
because that K-means algorithm is nevertheless a
machine learning algorithm, and it can be easily
affected by many factors, such as the initiation data.
Even very small changes can mean a lot to it.
Therefore, running it repeatedly ensures that the
results is not easily affected by the other factors. It
also makes sure that the result is stable and reliable.
Figure 2: The center points are plotted
6. INTERPRETATION OF THE RESULTS
AND RECOMMENDATIONS
After clustering, it becomes much easier to find the
two classes than in Figure 2. In Figure 3, the blue
cluster on the left (labeled “Cluster 1”) represents the
data points of all the real notes. The orange cluster on
the right (labeled “Cluster 2”) represents the data
points of all the fake notes. All of the points in each
class lies around their respective cluster centers.
According to the graph, when the skewness is below
0, variance below 5 means real notes, otherwise
means fake notes. When the skewness is above 0,
variance below 0 means real notes, otherwise means
fake notes. These data is recommended to be used to
inspect the banknotes.
Figure 3: The classification results
7. LIMITATIONS
This project has some limitations, mostly due to the K-means algorithm and the dataset.
A. There are limited entries in the dataset. This means that if anyone comes up with a new way of
forgery, it may not be in the dataset yet. Therefore, this can cause this project to fail to classify this
new kind of fake note as real or fake.
B. As mentioned in Section 3, there are multiple limitations of the K-means algorithm. There are four
factors, but due to these limitations, it is nearly impossible to use them all. This might lead to some
problems with the reliability.
C. The K-means algorithm assumes that each cluster is circular, therefore, the shape of the clusters
may be wrong. This also means that the classes might be wrong as well.
D. Two clusters may not be suitable for this project.
There is the elbow method of determining the
appropriate number of clusters, which indicates
the number on the “elbow” of the plot. According
to the results, the “elbow” bends between 2 and 4.
However, due to the length limit of this report,
separate clustering of 3 and 4 clusters cannot be
shown.
Figure 4: The “elbow” is between 2 and 4
Download