IDENTIFICATION OF FAKE BANKNOTES USING K-MEANS CLUSTERING Andy Zhang 1. THE PURPOSE OF THIS PROJECT This project tries to analyze banknotes by comparing them with real notes, and then telling if the note is real or fake. To do this, K-means clustering (which is a kind of unsupervised machine learning algorithm for finding cluster centers) is applied to classify the banknotes. Since the algorithm is ideal in both performance and reliability, we hope to bring a more accurate and efficient way of identifying genuine and forged banknotes. 2. A BRIEF OVERVIEW OF THE PROCEDURE First, the features from the images of all the notes (including fake ones) are extracted using a tool named Wavelet, and the features are converted into numbers. They are then compared with the ones converted from images of real notes. We did not do this ourselves, instead, al of the data are downloaded from OpenML. Then, some data processing is performed, which eliminated quite a number of outliers (data for too large in measuring range). This ensures that the potentially incorrect data cannot interfere with the results. This is necessary, because the K-means algorithm is sensitive to all outliers, and too many outliers may cause the algorithm to find the incorrect cluster centers. Finally, the K-means clustering algorithm is applied to the converted data and two classes (clusters) are formed. The first class contains all the real notes, while the second one contains all forged ones. Therefore, by looking up in the results of this project, one can tell whether a specific note has been forged or not, simply by looking at its parent group. There is also a high accuracy by this algorithm. This means that he needs not to run it through complicated analysis machines, as people traditionally do. Hence, the accuracy of fake detection can be improved while the time and effort needed can be minimized. This is beneficial to all banks and financial institutions. One can also add more entries to the dataset, and it can still give an accurate result of identification. This enables it to remain working throughout time. Other currencies will also work, as long as someone can construct a dataset containing all the converted differences. 3. THE ORIGINAL DATA The dataset for this project is from OpenML. It summarized the four factors of the differences, namely variance, skewness, kurtosis and entropy. There are totally 1372 entries of either real or fake banknotes in the dataset, and all of the images used are captured with a high-definition (400x400 pixels) industrial camera specially designed for print inspection. In this project, only the variance and skewness are considered. This is because: A. It is impossible to show all four factors in a graph. The maximum possible number of dimensions is 3, but 3-d graphs are hard to show in this report, and is equally hard to observe if they are shown. B. Although the K-means clustering algorithm theoretically works on any dimensions of data, it does not perform good when dealing with higher dimensions in real life. C. The algorithm requires much more computational power to execute, as it involves calculating distances. D. As the dimensions rise, the data points become more sparse and the results can be very unreliable and inaccurate. Even if it does, it’s impossible to show them anyway as mentioned in A. In this report, they are also referred to as “V1” and “V2” respectively. 4. PROCESSING THE ORIGINAL DATA The data is processed to remove all outliers. We used the Z-score method, which removes all values that are 2 standard derivations away from the mean. Using 2 standard derivations is a common practice in statistics. Figure 1: Scatter of variance and skewness after A scatter plot of these two factors is shown in Figure processing. 1. The transparency has been adjusted so that the clusters can be shown more clearly. As you can see, there graph roughly shows two areas with the densest points, called “clusters”. However, it is quite hard to tell by just looking with the naked eye. Therefore, the K-means algorithm has to be used to tell them apart. 5. CLUSTERING USING THE K-MEANS CLUSTERING ALGORITHM The K-means algorithm is applied to form two cluster centers, so that the two clusters formed can be clearly observed. First, this algorithm generates two cluster centers, which indicates areas that the data points are the densest. A graph of those points is shown in Figure 2. As you can see, the centers are on either sides of the graph, one left, one right. The left (orange) one is near the origin, indicating real notes, with minimal absolute variance and skewness. The right (green) one, however, is further from the origin, which indicates fake notes. Then, we can also use the algorithm to form two classes, each containing real notes and fake notes. This is called “classification”. In Figure 3, the graph shows the two classes from the classification in different colors. Finally, the classification part is run several times, until the cluster centers do not move at all. This is because that K-means algorithm is nevertheless a machine learning algorithm, and it can be easily affected by many factors, such as the initiation data. Even very small changes can mean a lot to it. Therefore, running it repeatedly ensures that the results is not easily affected by the other factors. It also makes sure that the result is stable and reliable. Figure 2: The center points are plotted 6. INTERPRETATION OF THE RESULTS AND RECOMMENDATIONS After clustering, it becomes much easier to find the two classes than in Figure 2. In Figure 3, the blue cluster on the left (labeled “Cluster 1”) represents the data points of all the real notes. The orange cluster on the right (labeled “Cluster 2”) represents the data points of all the fake notes. All of the points in each class lies around their respective cluster centers. According to the graph, when the skewness is below 0, variance below 5 means real notes, otherwise means fake notes. When the skewness is above 0, variance below 0 means real notes, otherwise means fake notes. These data is recommended to be used to inspect the banknotes. Figure 3: The classification results 7. LIMITATIONS This project has some limitations, mostly due to the K-means algorithm and the dataset. A. There are limited entries in the dataset. This means that if anyone comes up with a new way of forgery, it may not be in the dataset yet. Therefore, this can cause this project to fail to classify this new kind of fake note as real or fake. B. As mentioned in Section 3, there are multiple limitations of the K-means algorithm. There are four factors, but due to these limitations, it is nearly impossible to use them all. This might lead to some problems with the reliability. C. The K-means algorithm assumes that each cluster is circular, therefore, the shape of the clusters may be wrong. This also means that the classes might be wrong as well. D. Two clusters may not be suitable for this project. There is the elbow method of determining the appropriate number of clusters, which indicates the number on the “elbow” of the plot. According to the results, the “elbow” bends between 2 and 4. However, due to the length limit of this report, separate clustering of 3 and 4 clusters cannot be shown. Figure 4: The “elbow” is between 2 and 4