“Big Data” A New Security Paradigm Bill Lisse, CISSP, CISA, CIPP, PMP, G2700 Introduction • Global ISO for OCLC Online Computer Library Center, Inc. • Over 25 years of security, audit, investigative experience • Both U.S. Government and commercial organizations • Financial Institutions, Manufacturing & Distribution, Healthcare • OCLC WorldCat® • • • • • • 72,000 - Number of libraries represented worldwide 1,801,677,890 - Number of holdings 170 Countries and territories with library holdings in WorldCat Every 10 seconds - How often a record is added Over 470 - Number of languages and dialects represented Every 4 seconds - How often a request is filled through WorldCat Resource Sharing • 256,514,231 - Number of bibliographic records What Is "Big Data?” • Logical outgrowth of increased use of virtualization technology, cloud computing and data center consolidation • NoSQL, defined as non-relational, distributed, and horizontally scalable data stores (www.nosql-database.org) • Abandons the constraints of schemas and transactional consistency in favor of simple usage and massive scalability in terms of data storage and processing capabilities • A technology that can handle big data, storing more and being able to analyze the aggregate, at a scale beyond the reach of relational databases (RDBMS) • The NoSQL alternative to RDBMS ACID is sometimes described as BASE [Basically Available, Soft state, Eventual consistency] New Security Paradigm • Most of us don't have the tools or processes designed to accommodate nonlinear data growth Traditional security tools may no longer provide value • What Were Your Tools and Processes Designed to Do? • How difficult it is to do a malware scan across a large NAS volume, or SAN. Would it be feasible to scan through it all every day like we do now with 100K more? • If data discovery is required to support data leak prevention (DLP) or regulatory compliance, what are the implications? • Some scenarios where data size could be a factor in the proper operation of a security control are: log parsing file monitoring encryption/decryption of stored data file-based data integrity validation controls From a security standpoint, we are all starting from scratch Targets and Threats • "Eggs in one Basket" – Centralized data is a lucrative target • Growth in research and hacker activity targeting NoSQL databases (www.TeamShatter.com) • Ozzie (Proxy) is a superuser capable of performing any operation as any user • Hadoop Distributed File System (HDFS) proxies are authenticated by IP address; stealing the IP of a Proxy could allow an attacker to extract large amounts of data quickly • Name Nodes or Data Nodes can give access to all of the data stored in an HDFS by obtaining the shared "secret key” • Data may be transmitted over insecure transports including HSFTP, FTP and HTTP; HDFS proxies use the HSFTP protocol for bulk data transfers • Tokens: Must get them all - Kerberos Ticket Granting Token; Delegation Token; Shared Keys (if Possible); Job Token; Block Access Token Authentication Actors in Hadoop security • User (access HDFS and Map-Reduce services) • HDFS and Map-Reduce services (services user requests and coordinates among themselves to perform Hadoop cluster operations) • Proxy service like Oozie (accesses Hadoop services on behalf of user) Vulnerabilities • Almost every NoSQL developer is in a learning mode; over a hundred different NoSQL variants • HDFS does not provide high availability, because an HDFS file system instance requires one unique server, the name node (single point of failure) • Environments can include data of mixed classifications and security sensitivities • Aggregating data from multiple sources can cause access control and data entitlement problems • Aggregating data into one environment also increases the risk of data theft and accidental disclosure • There is no such thing as vulnerability assessment or database activity monitoring for NoSQL • Label security is based on schema, which does not exist in NoSQL; No Object level security (Collection, Column) Vulnerabilities (Cont.) • Encryption can be problematic • data and indices need to be in clear text for analysis, requiring application designers to augment security with masking, tokenization, and select use of encryption in the application layer • DoS attacks • NoSQL Application Vulnerabilities • • • • • • • Connection Pollution JSON Injection Key Brute Force HTTP/REST based attacks Server-side JavaScript: Integral to many NoSQL databases REST APIs and CSRF NoSQL Injection Architecture and Design Considerations • Define your use cases - Security requirements derived from core business and data requirements; assess if NoSQL is still a valid solution • Based on security requirements, decide if you should host your database(s) in your own Data Center or on the Cloud • Categorize use cases to see where NoSQL is a good solution and where it's not • Define Data Security Strategy and Standards • Data Classification is imperative • How do we prevent bad data from getting into NoSQL data store • Built-in HDFS security features such as ACLs and Kerberos used alone are not adequate for enterprise needs • Software running behind a firewall with inadequate security? • Authentication • Role Based Access Control (RBAC) • Support for AUTHN (Authentication) and AUTHZ (Authorization) • Some federated identity systems implemented with SAML, and environment security measures embedded with the cloud infrastructure • ACLs for Transactional as well as Batch Processes Architecture and Design Considerations (cont.) • Defense In Depth • The security features in Cloudera Distribution with Hadoop 3 meet the needs of most customers because typically the cluster is accessible only to trusted personnel • Hadoop's current threat model assumes that users cannot: • Have root access to cluster machines • Have root access to shared client machines • Read or modify packets on the network of the cluster • Middle Tier: Act as broker in interacting with Hadoop server: Apache Hive, Oozie etc. • RPC Connection Security: SASL GSSAPI • HDFS: Permissions Model • Job Control: ACL based; includes a View ACL • Web Interfaces: OOTB Kerberos SSL support • HDFS and MapReduce modules should have their own users • NoSQL DB Servers behind Firewall and Proxy Architecture and Design Considerations (cont.) • Separate persistence layer to apply Authentication and ACL's in a standard and centralized fashion • Batch jobs and other utility scripts that access database outside the applications should be controlled • Logging • Audit trails are whatever the application developer built in, so they are both application-specific and limited in scope • What data needs to be logged for security analytics purposes? • What should be the log format for business v. security logs? • Do we need to store the security logs in a different file (a new log4j appender) so only authorized users (admin) will have access to it? • How would the logs work with SIEM tool (if applicable) Architecture and Design Considerations (cont.) • If necessary, put NoSQL-stored data into separate "enclaves" to ensure that it can be accessed by only authorized personnel • Security infrastructure for Hadoop RPC uses Java SASL APIs • Quality of Protection (QOP) settings can be used to enable encryption for Hadoop RPC protocols Conclusion “In nutshell, Hadoop has strong support for authentication and authorization. On the other hand privacy and data integrity is optionally supported when Hadoop services are accessed through RPC and HTTP, while the actual HDFS blocks are transferred unencrypted. Hadoop assumes network involved in HDFS block transfer is secure and not publicly accessible for sniffing, which is not a bad assumption for private enterprise network.” - Nitin Jain http://clustermania.blogspot.com/2011/11/hadoop-how-it-manages-security.html Questions? More Information? Bill Lisse Bill.lisse@gmail.com