DATA SECURITY AND BIG DATA Carole Murphy November 20, 2013 Big Data Conferences • Major conferences are opportunities to learn, meet colleagues, see vendor demos 3 Executive Summary Five Things You Need To Know About Big Data Security 1. 2. 3. 4. 5. Time-to-insight is even more important than cost savings as a business driver for Hadoop Unless you take action, security is likely addressed later, and then applies brakes to the Big Data project Data security in the Hadoop ecosystem is about much more than authorization and authentication Traditional data security solutions protect data at rest, but not in use or in motion. The best solutions retain data value even as they remove security and compliance obstacles to the project Big Data presents an opportunity to address security and compliance across your IT environment. Look for adaptable and extensible security solutions 4 Big Data IS Now! • Biggest growth drivers • Accelerating Enterprise adoption • Maturing software • Increasingly sophisticated professional services • Continued investment • Transforming the Data Center • “By 2017, Big Data will be the norm for data management…”* *Forrester, The Top Emerging Technologies To Watch: Now Through 2018, by Brian Hopkins and Frank E. Gillett, February 7, 2013 5 Background Big Data – What’s Different? • • Data coming from many sources Doesn’t need a schema ETL DW BI Raw Load Hadoop BI – Dump raw loads of data into Hadoop • Hadoop processing is so fast – Compute in minutes what would take a night to batch process • BI is real-time – Ask questions you didn’t know you needed to ask • Elephant in the room – Data “lake” many times cheaper than DW path 6 ETL Offload Use Case* Hadoop (HDFS, Map Reduce, Pig) * Presented by MapR at Hadoop Summit, San Jose, June 2013 BI 7 Taming the Explosion in Data Exabytes per month Optimizing Time-to-Insight 80 70 60 50 40 30 20 10 0 2000 “90% of the data in the world today has been created in the last two years alone”* - IBM 2005 2010 2015 Parabolic growth in data created and consumed* - Cisco • The explosion in data fuels growth and agility • But time to data value is gated by risk and compliance • Attacks to data are here to stay, and big data means a big target • Balancing data access and data security is critical 8 Risk Increases as Data Moves to Cloud and Big Data Environments Risk Increases • • Individual Apps Mainframes OLTP Not created for the enterprise Security is just starting to be bolted on Hadoop Data Warehouse (Oracle, Teradata, Netezza, etc.) Cloud • Who has control of your data? 9 Extracting Value from Data Big Data Includes Sensitive Data • Marketing – analyze purchase patterns • Social media – find best customer segments • Financial systems – model trading data • Banking and insurance – 360° customer view • Security – identify credit card fraud • Healthcare – advance disease prevention How do you liberate the value in data – without increasing risk? 10 Why Projects Get Stopped Hidden Risks in Big Data Adoption Breach Risks Data Concentration Risks – Internal users – External shares – Backup’s, Hadoop stores, data feeds Data Sharing Risks – – – Compliance challenges with 3rd party risk Cross-border, data residency Data in and out of the enterprise – Financial position – Market position – Corporate Compliance risk Big Data Enables deeper data analysis More value from old data New risks if data is not protected Cloud Adoption Risks – Sensitive data in untrusted systems – Data in storage, in use, transmitted to cloud 11 Take Advantage of Big Data Benefits Identifying an Effective Data Security Strategy • Integrate security, enable access • Protect sensitive data before entering Hadoop, in Hadoop and on the way out • Enable accurate analytics on encrypted data • Assure compliance • Address global compliance comprehensively • Reduce audit scope for PCI to cut costs • Provable, verified, published, peer-reviewed, NIST recognized security techniques • Optimize performance and extensibility • High performance • Adapt to the newest tools in the ecosystem • Fit into infrastructure, fast and easy to implement 12 Options for Security Hadoop Community • SSL • Disabled by default; doesn’t cover all paths, adds latency and CPU load • Existing Hadoop access controls • Kerberos is still the primary way to secure a Hadoop cluster • Not fine-grained, can’t limit by data type or column • Inappropriate access post-analysis • Sentry from Cloudera • Offers permission controls accessing data through Hive • Knox from Hortonworks • Gateway server provides a single point of authentication and access for Hadoop services in a cluster • MapR native authentication and authorization • Transparent integration with Kerberos OR option for native authentication 13 Options for Security Commercial Data Security Products • Container-based encryption • Data-at-rest security at the block or file level • Do you want different people/applications to have access to different data types? • Traditional data-masking • 1-way-only limits use cases (e.g. fraud analysis) • Technique doesn’t support production use cases • Application level • Encryption and tokenization options • Consider standards-based approaches, key management 14 Goals • All sensitive data must be stored on disk in protected form (encrypted or tokenized) • Compliance requirements (PCI, HIPAA) • Disks are often removed from data center for servicing • There are many ways that data can flow into HDFS • Such as unstructured data being copied directly in • Sensitive data also should be protected during analysis • Because Hadoop has insufficient access controls • Provide access controls to data based on data type and project (data set) 15 Solutions for Handling Structured and Unstructured Data • Disk Volume-level (whole file) encryption • Enables compliance • Covers unstructured data, from all sources • Provides protection against drive loss • Good, but may not be sufficient • Doesn’t reduce audit scope for PCI DSS • Access controls in Hadoop can’t control user access at the field level, so access to the cluster may need to be restricted to pass PCI or HIPAA audit • Field-level tokenization and/or encryption • Enables wider use of the cluster by multiple teams • Data sharing with certain fields remaining protected • Protects against failures at multiple layers • Required for regulatory compliance in many cases 16 All Hadoop Integration Options Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Key Management, Tokenization and Policy Control BI Applications 17 Protecting Data Inbound to Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Before Ingestion Key Management, Tokenization and Policy Control BI Applications 18 Protecting Data Inbound to Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more During Ingestion Key Management, Tokenization and Policy Control BI Applications 19 Protecting Data Inbound to Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more + more Storage Encryption After Ingestion Key Management, Tokenization and Policy Control BI Applications 20 Retrieving Clear Data from Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Before export/query Key Management, Tokenization and Policy Control BI Applications 21 Retrieving Clear Data from Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more During export/query Key Management, Tokenization and Policy Control BI Applications 22 Retrieving Clear Data from Hadoop Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption BI Applications + more After export/query Key Management, Tokenization and Policy Control 23 PCI Data – Keep Hadoop and Data Warehouse out of Audit Scope Landing Zone Data Sources Data Warehouse ETL HDFS Batch Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Management, Tokenization and Policy Control BI Applications 24 PHI Data – Encrypted in Hadoop for HIPAA; Minimized Application Changes Data Warehouse Data Sources HDFS Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Key Management, Tokenization and Policy Control BI Applications 25 Private Application Data – Critical part of Compliance – 100% Transparent Data Warehouse Data Sources HDFS Sqoop Sqoop Map Reduce Map Reduce Flume Hive + more Storage Encryption + more Key Management, Tokenization and Policy Control BI Applications Use Case: Healthcare Company • Challenge • Big Data team tasked with securing large multi- node Hadoop cluster for HIPAA, HITECH • Challenging time-frames • Solution • Data de-identified in ETL move before entering Hadoop • Ability to decrypt analytic results when needed, through multiple tools • Benefits • Ability to leverage medical data to develop more targeted marketing strategies and services to key demographics 26 27 Use Case: Multi-national Bank • Challenge • PCI compliance is #1 driver • ETL offload use case with Hadoop alongside a traditional data warehouse • Solution • Integrate with Sqoop on ingestion; Hive on the applications / query side to protect dozens of data types • Fraud analysts work with tokenized credit card numbers • Benefits • Enable fraud analytics directly on protected data in Hadoop • Fraud analysts have ability to de-tokenize as needed with strict controls 28 Use Case: U.S. Military Organization • Challenge • US Surgeon General directive – share healthcare data with medical research institutes • Maintain HIPAA/HITECH Compliance • Solution • De-identified 100+TB dataset at field level before release • Format-preserving encryption enables distributed analytics in Hadoop • Usable data values for accurate analytics • Benefits • Secure re-identification by Agency as needed • Improved healthcare with compliance 29 Key Considerations • Most Big Data projects are associated with Data Warehouse projects… • What is your data warehouse strategy (e.g. expansion, ETL offload • • • • • to Hadoop, integrating new data sources…)? What is your use case(s)? What does the business need? If you use de-identified data in Hadoop, would you ever need to get back to the original data? Will you have sensitive data going into Hadoop (PII, PCI, PHI)? What compliance or privacy regulations are you concerned about addressing? Do you need data protection across disparate systems (open systems to mainframe)? 30 Security Checklist to Make Big Data Safe • Solves complex global compliance issues • Ensures data stays protected wherever it goes • Enables accurate analytics on encrypted data • Optimizes performance • Flexibly adapts to the fast-growing Hadoop ecosystem • Reduces PCI audit scope where applicable 31 About Voltage Security • Origins: DARPA Funded Research at Stanford University • Patented Innovations: 27 • Unstructured data: Identity Based Encryption (IBE) • Structured data: Format Preserving Encryption (FPE), Tokenization, Data Masking, Stateless Key Management • Leader in large scale data-centric security solutions. • Customers: 1200+ Enterprise Customers/Government Agencies. • Analyst Recognition: Gartner, Forrester, Burton IT1, Mercator • Contact Voltage Security: www.voltage.com 31 Copyright 2013 Voltage Security 31 THANK YOU