Big Data Security - David Veuve . COM

Gopi Ramamoorthy CISSP, CISA, CISM Agenda  Bigdata – Quick Overview  Bigdata Eco System – Quick Overview  Bigdata Security – Current Options  Bigdata Security – An efficient way Introduction  What is the presentation about?  Securing Big data using different available technologies without impacting performance  What is Bigdata?  Defined as data sets that are too large and complex to manipulate or interrogate with standard methods or tools.  Some of the characteristics of Bigdata are  4 Vs Introduction  What is the presentation about?  Securing Big data using different workflows and improved performance  What is Bigdata?  Defined as data sets that are too large and complex to manipulate or interrogate with standard methods or tools.  Some of the characteristics of Bigdata are  volume  velocity  variety  volatile nature Problem Overview  Feed to HDFS come from different sources  Hadoop eco system does not provide in built security and vault features similar to the ones provided by RDBMS database systems.  There are many components in eco system that do not address security directly or indirectly.  Encryption and decryption of huge amount of data will slow down the performance. Also at times, it will be heavy resource consuming Problem Overview  This presentation discusses building/changing infrastructure to resolve above problems without impacting performance and response time. Units used to measure Big Data Size Prefix 10 ^ n Symbol Giga 10 ^ 9 G Tera 10 ^ 12 T 10 ^ 15 10 ^ 18 10 ^ 21 10 ^ 24 Units used to measure Big Data Size Prefix 10 ^ n Symbol Example Data Channel Giga 10 ^ 9 G Tera 10 ^ 12 T Common with RDBMS databases Peta 10 ^ 15 or 1000 TB P User data created in an online site in a couple of hours Exa 10 ^ 18 or 1mil TB E Data created in internet every day Zetta 10 ^ 21 Z Yotta 10 ^ 24 Y Hadoop Eco System Category Tool / Framework Getting Data Into HDFS Flume, Sqoop, Scribe, Chukwa, Kafka Compute Frameworks MapReduce, YARN, Weave, ClouderaSDK Querying Data Pig, Hive, Impala, Java MapReduce, Hadoop Streaming, Cascading Lingual, Stinger /TEZ, Hadapt, Greenplum HAWQ, ClouderaSearch, Presto NoSQL Stores Hbase, Cassandra, Redis, Amazon SimpleDB, Voldermort, Accumulo Hadoop Eco System Category Tool / Framework Hadoop in the cloud Amazon EMR, Hadoop on Rackspace, Hadoop on Google Cloud Workflow Tools & Schedulers Oozie, Azkaban, Cascading, Scalding, Lipstick Serialization Frameworks Avro, Trevni, Protobuf, Parquet Monitoring Systems Hue, Ganglia, Open TSDB, Nagios Applications / Platforms Mahout, Giraph, Lily Distributed Coordination Zookeeper, Bookkeeper Distributed Message Processing Kafka, Akka, RabbitMQ BI Datameer,Tableau,Pentaho,SiSense,SumoLogic Hadoop Eco System Category Tool / Framework YARN-Based Frameworks Samza, Spark, Malhar, Giraph, Storm, Hoya Libraries & Frameworks Kiji, Elephant Bird, Summing Bird, Apache Crunch, Apache DataFu, Continuity Data Management Apache Falcon Security Apache Sentry, Apache Knox Testing Frameworks MrUnit, PigUnit Miscellaneous Apark, Shark Hadoop Eco System  Core: A set of shared libraries  HDFS: The Hadoop filesystem  MapReduce: Parallel computation framework  Flume: Collection and import of log and event data  Sqoop: Imports data from relational databases  ZooKeeper: Configuration management and coordination  HBase: Column-oriented database on HDFS  Hive: Data warehouse on HDFS with SQL-like access  Pig: Higher-level programming language for Hadoop computations  Oozie: Orchestration and workflow management  Impala: Realtime Querying tool  Mahout: A library of machine learning and data mining algorithms Basic Security  Network Separation  Authentication  Permission  Authorization  Management Solution  Encryption Efficient Security  Data categorization  Data Masking  Tokenization  Do not send sensitive data to HDFS if not required  Use Workflow  Separate sensitive data into another cluster  Monitor Hadoop Eco System  Deploy SIEM model monitoring Bigdata: Security based on Data and Work Flow  Identify Channels and Data Sources  Identify Data Content  Introduce/Extend Data Classification to Bigdata  Identify workflow  Select Access Methods  Select Encryption Methods  Select Analytics tool  Define Archive Policy  Define Purge and Retention Policy Must Features for Security Modules/Architecture  Key Manager  No impact to performance  HSM Integration and Support  Compliance Support  Easy to Administer and Migrate Data Categorization  Data categorization is well known concept that is used to implement different levels of security based on data.  For Big data , the data categorization needs to be extended to complete data flow from entry to end (purge).  Implement multiple big data clusters based on data category  More on coming slides Data Classification  Super Sensitive  DOB, SSN, IP, Design  Sensitive  Account, Address, Balance, etc.  Confidential / Private  Company Business Information, Vendor Information  Public  News Release, Public Finance Data Bigdata: Data LifeCycle Data Sources Channel 1 Encryption e4 Channel 4 Access a4 Channel 2 Encryption e4 Channel 3 Access a4 Channel 5 Encryption e4 Channel 7 Access a4 Channel 7 Channel 6 Encryption e4 Channel 8 Channel 8 Access a4 Channel 1 Channel 2 Channel 3 Analyze/Purge /Retention pr4 Archive ar4 Purge /Retention pr4 Archive ar4 Channel 4 Channel 5 Channel 6 Purge /Retention pr4 Archive ar4 Purge /Retention pr4 Archive ar4 References and Acknowledgments           Cloudera Project Rhino by Intel (open source) ZettaSet Apache Projects Hadoop Illuminated IBM Yahoo Oracle Horton And many more Questions

Big Data Security - David Veuve . COM

Related documents

Products

Support

Big Data Security - David Veuve . COM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib