Big Data – Hadoop vs MongoDB

advertisement
Mihai Pintea
Agenda
What is
Big Data
Hadoop
and
MongoDB
DataDirect
driver
2
3
What is Big Data?
Big data = data sets so large or complex that
traditional data processing applications are
inadequate
implications for everyone
transforming the way we do business
digital trace, which we can use and
analyze
make use of the increasing volumes of
data
4
How we generate Big Data?
Conversation Data
Activity Data
Photo and Video
Data
Sensor Data
Internet of Things
Data
5
What are the Big Data characteristics?
Quantity of data
Categories
of data
Variety
Volume
The 4
V’s of
Big Data
Velocity
Speed of
generating
data
Veracity
Quality of data
6
How to turn Big Data into Value ?
The ‘Datafication’
of our World:
• Activities
• Conversations
• Words
• Voice
• Social Media
• Browser Logs
• Photo
• Video
• Sensors
• …
Volume
Velocity
Veracity
Analyzing Big
Data:
• Text Analytics
• Sentiment
Analysis
• Face
Recognition
• Voice Analytics
• Movement
Analytics
• …
VALUE
Variety
7
DataDirect Connectivity for Big Data
Apache Hadoop Hive Data Solutions
Rapidly integrate Hadoop Hive with your cloud and onpremise applications, databases, files and social media
sources.
Apache Cassandra Data Solutions
Provides improved business performance and scalability for
integrating with Apache Cassandra managed systems
Amazon Redshift Data Solutions
Data access to Amazon’s fast and powerful data warehouse
service in the AWS cloud.
MongoDB Data Solutions
Streamlines access to the database and file-system data
and makes it easier to get data in and out of other
reporting and big data applications.
SAP HANA Data Solutions
Connectivity to SAP HANA to ease integration of in-memory
operational data.
8
What is Hadoop?
Software technology designed for storing
and processing large volumes of data
Open-source software framework written in Java for
distributed storage and distributed processing of very
large data sets on computer clusters
The core of Apache Hadoop consists of a storage part
(Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce).
The base Apache Hadoop framework consists of the
following modules: Hadoop Common, Hadoop
Distributed File System (HDFS), Hadoop YARN,
Hadoop MapReduce
9
Properties of a Hadoop System
● HDFS provides a write-once-read-many,
append-only access model for data.
● HDFS is optimized for sequential reads of large
files (64MB or 128MB blocks by default).
● HDFS maintains multiple copies of the data for
fault tolerance.
● HDFS is designed for high-throughput, rather
than low-latency.
● HDFS is not schema-based; data of any type
can be stored.
● Hadoop jobs define a schema for reading the
data within the scope of the job.
● Hadoop does not use indexes. Data is scanned
for each query.
● Hadoop jobs tend to execute over several
minutes or longer.
10
How Organizations Are Using Hadoop
Organizations typically use Hadoop for sophisticated, read-only
analytics or high volume data storage applications such as:
Risk modeling
Predictive
analytics
Machine
learning
ETL pipelines
Customer
segmentation
Active
archives
11
DataDirect driver for Hadoop
● Access and analyze
Hadoop data using
familiar SQL-based
reporting tools
● Progress DataDirect
delivers the fastest
performance for
connecting to Apache
Hive distributions
● Leveraging standard
ODBC / JDBC relational
data access methods
12
Benefits of DataDirect Hadoop Driver
 A single driver supports all platforms and all Hadoop
distributions out-of-the-box—for easier deployment and
ongoing management
 Meets the demands of low latency, real-time query and
analysis with superior throughput, CPU efficiency and
memory usage
 Instantly works with popular BI and analytics tools such
as Tableau, QlikView and SAP Crystal Reports
 Provides highly secure access with user authentication,
support for Hive Kerberos and SSL data encryption
 Ensures reliability and stability with the most complete
feature set and full standards compliance
 Fully supports Hive2 with improved concurrency for
better scalability
13
What is MongoDB?
Open-source document database written
in C++ that provides high performance,
high availability, and automatic scaling
Document Database: A record in MongoDB is
a document, which is a data structure composed of
field and value pairs. MongoDB documents are
similar to JSON objects.
High Performance: MongoDB provides high
performance data persistence
High Availability: To provide high availability,
MongoDB’s replication facility, called replica sets,
provide – automatic failover and data redundancy
14
Data Model Design of MongoDB
Embedded Data Model
Normalized Data Model
15
DataDirect MongoDB Driver
●
●
●
●
Available as ODBC and JDBC interfaces
Support of common RDBMS functionality such as joins
Deep Normalization to any level of nested JSON
SQL-92 compliant with industry-leading breadth of SQL
coverage
16
How MongoDB Driver Works?
Progress DataDirect maps complex MongoDB JSON
structures, including nested documents and nested
arrays into their most natural relational counterpart–
child tables that relate to a primary parent table.
17
MongoDB with Hadoop in Organizations
MONGODB
HADOOP
User data and metadata
management for product catalog
User analysis for personalized
search & recommendations
Orbitz
Management of hotel data and
pricing
Hotel segmentation to support
building search facets
Pearson
Student identity and access control.
Content management of course
materials
Student analytics to create
adaptive learning programs
Foursquare
User data, check-ins, reviews, venue
content management
User analysis, segmentation and
personalization
Tier 1
Investment
Bank
Tick data, quants analysis, reference
data distribution
Risk modeling, security and fraud
detection
eBay
18
- When big amount of
data is involved
- Work on small subsets
of data
- For analytical purpose.
- Processing time
measured in
milliseconds.
- Processing time
measured in minutes
and hours.
- For offline processing
- Eg: Weather
forecasting
- For real time
processing.
- Eg: search data on
real time.
19
20
Download