Catching Bad Guys with Graph Mining Suspicious network patterns may be the key to detecting criminals and fraudsters on e-commerce sites. By Polo Chau DOI: 10.1145/1925041.1925044 T he Internet opened a new operational channel for many services, like online auctions, shopping, and banking. Every day, millions of transactions happen over these services, collectively known as e-commerce, each in the blink of an eye. Unfortunately, the monetary incentives intrinsic to e-commerce attract the attention of criminals (the bad guys), leading to some new types of crime. For instance, multiple online identities are easy to create: a perpetrator could use his many alter-egos to execute sophisticated schemes, burying his trail deep under false covers, and evading traditional detection methods that only examine identities individually. Furthermore, e-commerce generates so much data that discovering the bad guys, or their alter-egos, among the overwhelming amount of data seems daunting. On the bright side, bad guys leave trails. If we look closely, sometimes we can identify their suspicious operation patterns. What are the patterns? How do we detect them? Perhaps these questions would be easier to answer if we view the world a little differently—as a giant graph (or network) of nodes (people) and edges (relationships among people). Detecting bad behaviors then becomes locating some suspicious patterns as collections of incriminating relationships in the graph. This process of locating useful information and patterns in graph data is called graph mining, and it has been successfully applied to many domains. Here we look at how it works in e-commerce to help catch the bad guys. 16 Detecting Fraud by Aggregating Incriminating Evidence Online auctions like eBay are popular avenues for buying and selling almost any items imaginable. Most items will be delivered, but some unfortunately will not, because some sellers are crooks who never intend to do so. How do they convince the buyers that they are legitimate? They game the reputation systems that most online auctions set up to help buyers gauge sellers’ trustworthiness. One type of fraud scheme works as follows. The bad guy first creates multiple identities “E-commerce generates so much data that discovering the bad guys, or their alter-egos, among the overwhelming amount of data seems daunting. On the bright side, bad guys leave trails...” XRDS • Spring 2 01 1 • V ol .17 • No.3 “E-commerce has redefined crime. We now see new breeds of online crime where technologically savvy criminals exploit not only the weaknesses of human nature, but also the systems originally designed to protect online shoppers.” in the online auction, dividing them into two groups (“fraudsters” and “accomplices”). The fraudsters rarely trade among themselves, and neither do the accomplices. Then, the bad guy uses the accomplices to artificially boost the reputation of the fraudsters. The accomplices typically act like normal, honest users who buy and sell items (usually cheap, to lower operating costs), but they sometimes sell expensive items to the fraudsters, leaving glowing comments about how the buyers ( fraudsters) are good guys (“paid on time”, “easy to communicate with”, etc.). After the fraudsters have reached high reputation, they launch deceptive auctions to sell expensive items (e.g., big-screen TVs), usually at bargain prices, to the victims (honest people). Those items will never be delivered. We call the above interaction pattern a “bipartite core” (Figure 1), where two types of nodes ( fraudsters and accomplices) only interact with nodes of different types, but not with their own. This pattern forms the infrastructure that criminals set up before they carry out auction fraud. But to the naked eye, the associations between the identities involved in the deceptive bipartite core pattern might not be apparent. The NetProbe system [1] was developed to dig out these identities in this pattern, by automatically scanning connections between buyers and sellers, several layers deep, to look for arXRDS • Spring 2 01 1 • V ol .17 • No.3 tificial feedback, revealing identities and their associations that match the bipartite core. The system ran through over one million transactions and correctly picked out dozens of previously identified criminals; it also identified tens of probable fraudsters and apparent accomplices. Under the hood, NetProbe uses an inference algorithm called Belief Propagation to infer which nodes in the auction graph are most likely to be fraudsters and accomplices. The system first uses heuristics to assign a vector of three probabilities—called the node’s belief—to each node: a fraudster probability, an accomplice probability, and an honest probability. For example, if an identity has been active for many years and has not received any negative comments from other people, then that identity has a high honest probability; if an account was recently shut down right after it received many complaints, then it has a high fraudster probability. These three probabilities sum up to 1. Table 1. Conditional probability table describing a “bipartite core”; ε is a small constant close to zero. For example, entry (F, A), with a value of 1-2ε, describes a very high probability of a node’s neighbor being an accomplice (A) given the node itself being a fraudster (F). Accomplice (A) 0.5 Honest (H) ε H A F Fraudster (F) ε 1-2ε ε 2ε 0.5-2ε (1-ε)/2 (1-ε)/2 NetProbe’s algorithm then uses the matrix in Table 1 to transform each node’s belief into a message (also a probability vector) that the node will send to each of its neighbors; the message represents what the node thinks about its neighbors. The transformation is similar to multiplying the node’s belief with the matrix. For example, if a node has high fraudster probability, then applying the transformation on it will create a message for each neighbor that says the neighbor is likely an accomplice. All nodes simultaneously send out messages to their neighbors. Each node gathers its incoming messages, multiplies them into one vector (which also resolves competing messages similar to majority voting), then sets that vector as the node’s new belief. Finally, the node generates new messages for its neighbors using its updated belief. This whole process continues until all node beliefs do not change anymore. NetProbe then calls out the likely fraudsters and accomplices, and warn off potential bidders. The idea of propagating information across a graph and aggregating it to produce high-level conclusion is powerful. It inspired the creation of the generalized Snare system [2] applicable for various kinds of fraud and anomaly detection tasks. Snare was used on some general ledger data (a network of interconnected accounts) to detect financial fraud, boosting the detection rates of misstated accounts by 5.5 times. User-Centered and Automatic Pattern Detection Sometimes, analysts need to experiment with multiple patterns that, hopefully, would match the actual incriminating patterns. Creating a separate algorithm for each such pattern is costly and time-consuming, especially since most patterns will end up not being useful. Can we provide one tool that detects a wide range of patterns quickly and easily? The Graphite system [3] aims to meet this challenge. It provides a direct-manipulation user interface for the user to construct the query pattern by placing nodes on the screen, assigning types to them, and connecting the nodes with edges. For example, the query pattern in Figure 2 asks for money laundering rings of alternating businessmen and bankers. Graphite then locates the pattern’s exact and approximate matches in a large graph of the user’s choosing. Graphite advances over existing algorithms that detect only structural patterns without considering the types of the nodes that compose the patterns; it enables more specific patterns to be found. Consider a communication network where each node is a person from a country (country is the node type). Our analyst Laura wants to locate four 17 collaborators who are from Japan, Italy, Canada, and Greece respectively, and she believes they likely form a clique (i.e., every pair has communicated). With Graphite, Laura sketches a 4-node clique as the query pattern and assigns the countries as the node types. But if she was to use another tool where the node types cannot be specified, any 4-node cliques will be returned (like a family of four who all reside in the US), overwhelmingly Laura with irrelevant information. The Holy Grail of anomaly detection is that the detection happens automatically and human does not need to do anything at all. While this may seem to be a distant goal, the Oddball system [4] is a big step towards such goal. The main idea behind Oddball is that it extracts a set of features (without human intervention) that summarize each node’s neighborhood subgraph, called the node’s “egonet,” which includes the node’s immediate neighbors and all edges in the neighborhood. Then, Oddball uses unsupervised methods that automatically correlate pairs of features and pinpoints nodes whose features significantly deviate from those of the rest of the nodes. Oddball can detect several important patterns, such as near-cliques and near-stars (by correlating the total edge weight and total edge count in the egonet). For example, in a who-called-whom network, the center of a near star could be a telemarketer who has called many random people, and a near clique could be a close-knit group of friends. E-commerce has redefined crime. We now see new breeds of online crime where technologically savvy criminals exploit not only the weaknesses of the human nature, but also the systems originally designed to protect online shoppers. Many criminals have learned to cover their tracks with the large amount of data generated by e-commerce, and obfuscate law enforcement with multiple fake virtual identities. As e-commerce thrives and the online world becomes even more connected, tools and methods such as those from graph mining will play an increasingly important role in untangling the many layers of sophisticated organization and schemes crafted by criminals. Will online crime be elimi18 Figure 1. A “near bipartite core” of fraudsters and accomplices. Honest identities are not shown. Note that each fraudster has traded with most, but not all, accomplices; hence it is a near, but incomplete, core. Query Pattern Near Match Banker Businessman Figure 2. Given a query pattern, such as a money laundering ring (left), the Graphite system can find both exact and near matches that tolerates a few extra nodes (right). Fraudsters Accomplices nated? Perhaps not. But our effort will force crooks to resort to more complex schemes that incur more effort and higher cost, so crime will be increasingly difficult to commit. Then, perhaps, fewer bad guys would attempt to get on the wrong side of the law. Biography References 1. Pandit, S., Chau, D.H., Wang, S. and Faloutsos, C. NetProbe: A fast and scalable system for fraud Detection in Online Auction Networks. In Proc. WWW 2007, 201-210. 2. McGlohon, M., Bay, S., Anderle, M., Steier, D. and Faloutsos, C. SNARE: A link analytic system for graph labeling and risk detection. In Proc. KDD 2009 , 12651274. 3. Chau, D.H., Faloutsos, C., Tong, H., Hong, J.I. Gallagher, B. and Eliassi-Rad, T. GRAPHITE: A visual query system for large graphs. In Proc. ICDM 2008 , 963-966 4. Akoglu, L. McGlohon, M. and Faloutsos, C. OddBall: Spotting anomalies in weighted graphs. In Proc. PAKDD 2010 , 410-421. Polo Chau is a Ph.D. student in the Machine Learning Department at Carnegie Mellon University. His research intersects graph mining and human-computer interaction. He builds interactive systems that help analysts explore and make sense of large graph data, find patterns, detect fraud, and spot anomalies. His work on fraud detection in online auctions appeared in the Wall Street Journal and many other media outlets. He was a Symantec fellow for two consecutive years, and is an avid designer, having won many awards. © 2011 ACM 1528-4972/11/0300 $10.00 XRDS • Spring 2 01 1 • V ol .17 • No.3