HENGHA: DATA HARVESTING DETECTION ON HIDDEN DATABASES Shiyuan Wang, Divyakant Agrawal, Amr El Abbadi University of California, Santa Barbara CCSW 2010 Data Security Concern: Back-End Databases of Web-based Applications • Form-based query interfaces provide entrance to both users and attackers. • Traditional Attacks • Submit malicious requests to break in the hidden database through vulnerable holes in the application, e.g. SQL injection [Vale05]. • Many can be detected by prior work. 10/8/2010 2 Data Security Concern: Back-End Databases of Web-based Applications • Data Harvesting Attacks • Iteratively submit legitimate queries to extract data inventory or infer sensitive aggregate information. • E.g 1. A competitor of a car rental company A harvested A’s inventory about a popular car. • E.g 2. Terrorists inferred that a flight was relatively empty and could be a hijacking target. 10/8/2010 3 Anatomy of Data Harvesting Attacks • General strategy • Iteratively submit legitimate queries with valid fields, analyze the results and then design new queries with the goal of maximizing information gain through limited #queries. • Two types of harvesting attacks to consider • Crawling Attack • Performed by deep web crawling [Madh08] • Sampling Attack • Performed by uniform random sampling on results of sizes no more than K [Dasg09] 10/8/2010 4 How To Defend Against Data Harvesting Attacks • Database inference control [Denn83]? • Query set restriction is not effective, especially on sampling attacks. • Query set restriction and data perturbation [Dasg09] hurt usability. • Web robot detection [Tan02]? • Data harvesters can camouflage normal users’ http traffic patterns. 10/8/2010 5 Our Approach • Detection based on search behaviors within sessions • Attackers’ search behaviors • Diversity • Queries are not concentrated and localized, and they reflect very • distinct intents • Broadness • The results of the queries cover a broad scope of the underlying data. 10/8/2010 6 HengHa: Detecting Data Harvesting Attacks at Single Session Level • Identify data harvesting attackers by examining if their search behaviors in a session show relatively significant diversity and broadness. • Diversity -> query correlation • Broadness -> result coverage HengHa 10/8/2010 D E T E C T O R Heng: query correlation observer Web query Application suspicious Ha: result coverage monitor result DB 7 Heng: Query Correlation Observer • Key idea • Frequent predicate value sets as indications of correlations among queries Queries in a Session That Plans Trip to Chicago • Intuitively, if a session has more frequent predicate value sets with higher supports, and those predicate value sets are more similar to the queries, the queries in this session are more correlated. 10/8/2010 8 Ha: Result Coverage Monitor • Key idea • Sort multi-attribute data D in a total order, e.g. z-curve, that preserves locality. • Create a coverage bit vector (CBV), where the bits correspond to the data in the total order. • Access a data -> set a bit • Training • Cluster CBVs to model different data access patterns 10/8/2010 y 1 3 0 2 0000 0100 1100 1110 x 9 Experiment • Extracted 98,564 real user query sessions and a data table of 387 records from KDD Cup 2000 clickstream dataset • Synthesized 1000 attack sessions [Madh08, Dasg09] • Run on a server with Intel 2.4GHz CPU, 3GB RAM and FC 8 OS • Performed four folds cross-validation Effectiveness Efficiency of Detection of Detection in Four Validations in Four Validation 10/8/2010 10 Conclusion & Future Work • Identified non-traditional data harvesting attacks on the back-end databases of web-based applications, i.e. crawling attack and sampling attack. • Detection based on identifying attackers’ special search behaviors at single session level, diversity->query correlation observer, broadness->result coverage monitor. • Detecting cross-session data harvesting attacks will be considered in the future work. 10/8/2010 11 References • [Vale05] F. Valeur et al. A learning-based approach to the • • • • detection of sql attacks. In DIMVA, pages 123–140, 2005. [Dasg09] A. Dasgupta et al. Privacy preservation of aggregates in hidden databases: why and how? In SIGMOD, pages 153–164, 2009. [Madh08] J. Madhavan et al. Google’s deep web crawl. PVLDB, 1(2):1241–1252, 2008. [Tan02] P.-N. Tan et al. Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov., 6(1):9–35, 2002. [Denn83] D. E. Denning et al. Inference controls for statistical databases. Computer, 16(7):69–82, 1983. 10/8/2010 12 Thanks for Listening 10/8/2010 13