Mihai Pintea Agenda What is Big Data Hadoop and MongoDB DataDirect driver 2 3 What is Big Data? Big data = data sets so large or complex that traditional data processing applications are inadequate implications for everyone transforming the way we do business digital trace, which we can use and analyze make use of the increasing volumes of data 4 How we generate Big Data? Conversation Data Activity Data Photo and Video Data Sensor Data Internet of Things Data 5 What are the Big Data characteristics? Quantity of data Categories of data Variety Volume The 4 V’s of Big Data Velocity Speed of generating data Veracity Quality of data 6 How to turn Big Data into Value ? The ‘Datafication’ of our World: • Activities • Conversations • Words • Voice • Social Media • Browser Logs • Photo • Video • Sensors • … Volume Velocity Veracity Analyzing Big Data: • Text Analytics • Sentiment Analysis • Face Recognition • Voice Analytics • Movement Analytics • … VALUE Variety 7 DataDirect Connectivity for Big Data Apache Hadoop Hive Data Solutions Rapidly integrate Hadoop Hive with your cloud and onpremise applications, databases, files and social media sources. Apache Cassandra Data Solutions Provides improved business performance and scalability for integrating with Apache Cassandra managed systems Amazon Redshift Data Solutions Data access to Amazon’s fast and powerful data warehouse service in the AWS cloud. MongoDB Data Solutions Streamlines access to the database and file-system data and makes it easier to get data in and out of other reporting and big data applications. SAP HANA Data Solutions Connectivity to SAP HANA to ease integration of in-memory operational data. 8 What is Hadoop? Software technology designed for storing and processing large volumes of data Open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). The base Apache Hadoop framework consists of the following modules: Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, Hadoop MapReduce 9 Properties of a Hadoop System ● HDFS provides a write-once-read-many, append-only access model for data. ● HDFS is optimized for sequential reads of large files (64MB or 128MB blocks by default). ● HDFS maintains multiple copies of the data for fault tolerance. ● HDFS is designed for high-throughput, rather than low-latency. ● HDFS is not schema-based; data of any type can be stored. ● Hadoop jobs define a schema for reading the data within the scope of the job. ● Hadoop does not use indexes. Data is scanned for each query. ● Hadoop jobs tend to execute over several minutes or longer. 10 How Organizations Are Using Hadoop Organizations typically use Hadoop for sophisticated, read-only analytics or high volume data storage applications such as: Risk modeling Predictive analytics Machine learning ETL pipelines Customer segmentation Active archives 11 DataDirect driver for Hadoop ● Access and analyze Hadoop data using familiar SQL-based reporting tools ● Progress DataDirect delivers the fastest performance for connecting to Apache Hive distributions ● Leveraging standard ODBC / JDBC relational data access methods 12 Benefits of DataDirect Hadoop Driver A single driver supports all platforms and all Hadoop distributions out-of-the-box—for easier deployment and ongoing management Meets the demands of low latency, real-time query and analysis with superior throughput, CPU efficiency and memory usage Instantly works with popular BI and analytics tools such as Tableau, QlikView and SAP Crystal Reports Provides highly secure access with user authentication, support for Hive Kerberos and SSL data encryption Ensures reliability and stability with the most complete feature set and full standards compliance Fully supports Hive2 with improved concurrency for better scalability 13 What is MongoDB? Open-source document database written in C++ that provides high performance, high availability, and automatic scaling Document Database: A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. High Performance: MongoDB provides high performance data persistence High Availability: To provide high availability, MongoDB’s replication facility, called replica sets, provide – automatic failover and data redundancy 14 Data Model Design of MongoDB Embedded Data Model Normalized Data Model 15 DataDirect MongoDB Driver ● ● ● ● Available as ODBC and JDBC interfaces Support of common RDBMS functionality such as joins Deep Normalization to any level of nested JSON SQL-92 compliant with industry-leading breadth of SQL coverage 16 How MongoDB Driver Works? Progress DataDirect maps complex MongoDB JSON structures, including nested documents and nested arrays into their most natural relational counterpart– child tables that relate to a primary parent table. 17 MongoDB with Hadoop in Organizations MONGODB HADOOP User data and metadata management for product catalog User analysis for personalized search & recommendations Orbitz Management of hotel data and pricing Hotel segmentation to support building search facets Pearson Student identity and access control. Content management of course materials Student analytics to create adaptive learning programs Foursquare User data, check-ins, reviews, venue content management User analysis, segmentation and personalization Tier 1 Investment Bank Tick data, quants analysis, reference data distribution Risk modeling, security and fraud detection eBay 18 - When big amount of data is involved - Work on small subsets of data - For analytical purpose. - Processing time measured in milliseconds. - Processing time measured in minutes and hours. - For offline processing - Eg: Weather forecasting - For real time processing. - Eg: search data on real time. 19 20