Gopi Ramamoorthy CISSP, CISA, CISM Agenda Bigdata – Quick Overview Bigdata Eco System – Quick Overview Bigdata Security – Current Options Bigdata Security – An efficient way Introduction What is the presentation about? Securing Big data using different available technologies without impacting performance What is Bigdata? Defined as data sets that are too large and complex to manipulate or interrogate with standard methods or tools. Some of the characteristics of Bigdata are 4 Vs Introduction What is the presentation about? Securing Big data using different workflows and improved performance What is Bigdata? Defined as data sets that are too large and complex to manipulate or interrogate with standard methods or tools. Some of the characteristics of Bigdata are volume velocity variety volatile nature Problem Overview Feed to HDFS come from different sources Hadoop eco system does not provide in built security and vault features similar to the ones provided by RDBMS database systems. There are many components in eco system that do not address security directly or indirectly. Encryption and decryption of huge amount of data will slow down the performance. Also at times, it will be heavy resource consuming Problem Overview This presentation discusses building/changing infrastructure to resolve above problems without impacting performance and response time. Units used to measure Big Data Size Prefix 10 ^ n Symbol Giga 10 ^ 9 G Tera 10 ^ 12 T 10 ^ 15 10 ^ 18 10 ^ 21 10 ^ 24 Units used to measure Big Data Size Prefix 10 ^ n Symbol Example Data Channel Giga 10 ^ 9 G Tera 10 ^ 12 T Common with RDBMS databases Peta 10 ^ 15 or 1000 TB P User data created in an online site in a couple of hours Exa 10 ^ 18 or 1mil TB E Data created in internet every day Zetta 10 ^ 21 Z Yotta 10 ^ 24 Y Hadoop Eco System Category Tool / Framework Getting Data Into HDFS Flume, Sqoop, Scribe, Chukwa, Kafka Compute Frameworks MapReduce, YARN, Weave, ClouderaSDK Querying Data Pig, Hive, Impala, Java MapReduce, Hadoop Streaming, Cascading Lingual, Stinger /TEZ, Hadapt, Greenplum HAWQ, ClouderaSearch, Presto NoSQL Stores Hbase, Cassandra, Redis, Amazon SimpleDB, Voldermort, Accumulo Hadoop Eco System Category Tool / Framework Hadoop in the cloud Amazon EMR, Hadoop on Rackspace, Hadoop on Google Cloud Workflow Tools & Schedulers Oozie, Azkaban, Cascading, Scalding, Lipstick Serialization Frameworks Avro, Trevni, Protobuf, Parquet Monitoring Systems Hue, Ganglia, Open TSDB, Nagios Applications / Platforms Mahout, Giraph, Lily Distributed Coordination Zookeeper, Bookkeeper Distributed Message Processing Kafka, Akka, RabbitMQ BI Datameer,Tableau,Pentaho,SiSense,SumoLogic Hadoop Eco System Category Tool / Framework YARN-Based Frameworks Samza, Spark, Malhar, Giraph, Storm, Hoya Libraries & Frameworks Kiji, Elephant Bird, Summing Bird, Apache Crunch, Apache DataFu, Continuity Data Management Apache Falcon Security Apache Sentry, Apache Knox Testing Frameworks MrUnit, PigUnit Miscellaneous Apark, Shark Hadoop Eco System Core: A set of shared libraries HDFS: The Hadoop filesystem MapReduce: Parallel computation framework Flume: Collection and import of log and event data Sqoop: Imports data from relational databases ZooKeeper: Configuration management and coordination HBase: Column-oriented database on HDFS Hive: Data warehouse on HDFS with SQL-like access Pig: Higher-level programming language for Hadoop computations Oozie: Orchestration and workflow management Impala: Realtime Querying tool Mahout: A library of machine learning and data mining algorithms Basic Security Network Separation Authentication Permission Authorization Management Solution Encryption Efficient Security Data categorization Data Masking Tokenization Do not send sensitive data to HDFS if not required Use Workflow Separate sensitive data into another cluster Monitor Hadoop Eco System Deploy SIEM model monitoring Bigdata: Security based on Data and Work Flow Identify Channels and Data Sources Identify Data Content Introduce/Extend Data Classification to Bigdata Identify workflow Select Access Methods Select Encryption Methods Select Analytics tool Define Archive Policy Define Purge and Retention Policy Must Features for Security Modules/Architecture Key Manager No impact to performance HSM Integration and Support Compliance Support Easy to Administer and Migrate Data Categorization Data categorization is well known concept that is used to implement different levels of security based on data. For Big data , the data categorization needs to be extended to complete data flow from entry to end (purge). Implement multiple big data clusters based on data category More on coming slides Data Classification Super Sensitive DOB, SSN, IP, Design Sensitive Account, Address, Balance, etc. Confidential / Private Company Business Information, Vendor Information Public News Release, Public Finance Data Bigdata: Data LifeCycle Data Sources Channel 1 Encryption e4 Channel 4 Access a4 Channel 2 Encryption e4 Channel 3 Access a4 Channel 5 Encryption e4 Channel 7 Access a4 Channel 7 Channel 6 Encryption e4 Channel 8 Channel 8 Access a4 Channel 1 Channel 2 Channel 3 Analyze/Purge /Retention pr4 Archive ar4 Purge /Retention pr4 Archive ar4 Channel 4 Channel 5 Channel 6 Purge /Retention pr4 Archive ar4 Purge /Retention pr4 Archive ar4 References and Acknowledgments Cloudera Project Rhino by Intel (open source) ZettaSet Apache Projects Hadoop Illuminated IBM Yahoo Oracle Horton And many more Questions