Christopher Gates, Ninghui Li, Zenglin Xu Purdue University Suresh N. Chari, Ian Molloy, Youngja Park IBM TJ Watson Research } } Intellectual Property Theft is an important security problem Insider Threat ◦ Legitimate access ◦ In-depth knowledge of resources ◦ Knowledge of deployed security mechanisms } Stolen Credentials ◦ Can utilize other persons legitimate access } Limit exposure via access control ◦ Users need access ◦ Productivity is often seen as more important } Encrypt data at rest ◦ Does not stop legitimate access } Use high level statistics for detection ◦ Does not capture more fine grained detail ◦ Does not give specific guidance for violation } } } Exploit knowledge about resources to detect deviation from access history Can also be viewed as estimating/controlling risks of aggregated accesses by one user Two kinds of malicious insiders ◦ Impetuous ◦ Patient } Generate a score for a set of accesses given a history Score between two files score ( f , g) Related to all history score !" f A H #$ = aggg∈A !"score ( f , g)#$ All files in current period sumScore =∑ score !" fk A H #$ k=1 Normalize aveScore = M sumScore M } } Files are not accessed randomly within a hierarchy There are reasons to access specific areas ◦ Job function ◦ Project ◦ Related content } Similarity can also have many facets ◦ Distance ◦ Access similarity ◦ File type/content Name Binary Full Distance Lowest Common Ancestor (LCA) Log LCA Access Similarity Formula } } 3 aggregation functions : score !" f A H #$ = aggg∈A !"score ( f , g)#$ Relates f to all files in the history g ∈ AH ◦ min : The lowest score( f , g) ◦ ave : Average all score( f , g) ◦ k-nearest : Compares to k lowest score( f , g) } CMVC Source Code Management System ◦ Log data: [user, timestamp, action, resource] } For evaluation we used 1 year of log data ◦ ~512k unique files ◦ ~133k unique directories ◦ ~2k users ◦ 1 period to bootstrap, 10 to train, 1 to test. } Check a users current access against their history Simple } Easy to understand } Detects deviations from past behavior } } } This can catch an impetuous attacker. Patient adversary can seed file accesses in previous time periods to affect similarity of distance based scores Gives a relation of expected behavior across all profiles. } Malicious user can only affect their own history. } user1 u1Score u2Score … uNScore user2 u1Score u2Score … uNScore … u1Score u2Score … uNScore userN u1Score u2Score … uNScore Features Description Unique File Count Main technique currently used in practice New Unique File Count Binary Method, new unique in window Average Similarity Score LogLCA Self Score values, [0,1] Sum Similarity Score LogLCA Sum Score values Mean Distance - Find a single point in to summarize previous periods over similarity between user features. - Use cosine similarity to find distance between the current point and the expected point. Mean Distance * New Unique Since the goal is to detect theft of files, and mean distance doesn’t have a feature to represent the number of files accessed, we combine the mean distance by the number of new unique files. } No ground truth data for malicious behavior } Generate simulated attacks by injecting directories } Three size ranges for the injection } Inject in two ways ◦ Represents targeted attacks on specific data ◦ 500-1000 : 10 unique attacks ◦ 1000-2500 : 12 unique attacks ◦ 5000+ : 2 unique attacks ◦ Impetous Attacker : Inject X accesses in current period ◦ Patient Attacker : Seed the current users history with files from the injection, then inject } Injecting } Injecting } Similarity scores may help communicating events ◦ Better detection of truly anomalous activity Go beyond simple file counts Create a ranking of most anomalous users ◦ Better understanding of what is causing the score Ranking the files that a user is accessing Allows for an incident response team to more quickly understand why a user is received a high score } } Explored using file similarity features to identify malicious insiders Evaluated with real access logs and synthetic attacks