Uploaded by mohamed yahyaoui

big data résumé

advertisement
Big data is a set of technologies, architectures, tools and procedures allowing an organization
to very quickly capture, process and analyze large quantities, heterogeneous and changing
content, to extract the relevant information at an accessible cost.
Big data applications: education, manufacturing, security, retail,
 financial services: fraud detection, risk management
 telecommunication: churn prediction, geomapping/marketing, network monitoring
 healthcare: Epidemic early warning & Intensive Care Unit and remote monitoring
Big Data Vs:
 Variety: Different forms of data sources & different types of data
 Velocity: Speed of generation of data, Real time data from different data sources
 Veracity: uncertainty of data
 Value: What can we do with this data?
 Volume: data size
 Variability: It is about the context in data
 Visualization: It means making data talk, story of data
Big Data Challenges
 Storage and transport issues: The quantity of available data has exploded…
 Management issues: resolving issues of access, usage, update…
 Processing issues: assume that one Exabyte data needs to be processed
 Quality Vs Quantity: How much data is needed to extract good knowledge from it?
 Data Ownership: who owns data?
 Compliance and security: in health and social media domains, data is accumulated about
individuals.
Big Data = Big Data Analytics “using data that was previously ignored because of technology
limitations”
Big data problems: Hardware improvements through the years...
 Disk Capacity
 RAM Memory
 CPU Speeds:
 Disk Latency (speed of reads and writes)
Solution  Use multiple processors/disks to solve the same problem by fragmenting it into
pieces.


 Parallel Data Processing?
Parallel data processing was with us for a while:
 GRID computing - spreads processing load
 Distributed workload - hard to manage applications, overhead on developer
 Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data)
Distributed computing:
 Multiple computers appear as one super computer,
 communicate with each other by message passing,
 operate together to achieve a common goal
 Challenges: Heterogeneity, Security, Scalability, Concurrency, Fault tolerance,
Transparency
Need to process huge datasets on large clusters of computers
 Very expensive to build reliability into each application
 Nodes fail every day
 The number of nodes in a cluster is not constant
 Need a common infrastructure efficient, reliable, easy to use, Open Source, Apache
Licence : Solution?  Hadoop
Hadoop is an OS software framework for reliable, scalable, distributed computing of massive
amount of data, Hides underlying system details and complexities from user
 Consists of 3 sub projects: MapReduce+ HDFS+ Hadoop Common
 Supported by several Hadoop-related projects: HBase, Zookeeper, Avro
 Meant for heterogeneous commodity hardware
Design principles of Hadoop
New way of storing and processing the data:
 Let the system handles most of the issues automatically (Failures, Scalability, Reduce
communications)
 Distribute data and processing power to where the data is
 Make parallelism part of the operating system
 Brings processing to Data!
 Optimized to handle
 Massive amounts of data through parallelism
 A variety of data (structured, unstructured, semi-structured)
 Using inexpensive commodity hardware, Relatively inexpensive hardware
 Reliability provided through replication
Hadoop V1 is not for all types of work!
 Not to process transactions (random access)
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensive calculations with little data
Apache Hadoop?
• Flexible, enterprise-class support for processing large volumes of data
 Well-suited to batch-oriented, read-intensive applications
 Supports wide variety of data
• Enables applications to work with thousands of nodes and petabytes of data in a highly
parallel, cost effective manner
 CPU + disks = “node”
 Nodes can be combined into clusters
 New nodes can be added as needed without changing (Data formats, How data is
loaded, How jobs are written)
Cours Big Data – résumé fait par Arfaoui A.
Two key aspects of Hadoop?
• Hadoop Distributed File System = HDFS
 Where Hadoop stores data
 A file system that spans all the nodes in a Hadoop cluster
 It links together the file systems on many local nodes to make them into one big file
system
o Distributed
o Reliable
o Commodity gear
•MapReduce framework
 How Hadoop understands and assigns work to the nodes (machines)
o Parallel programming
o Fault tolerant
Slave (DataNode) manages storage attached to the nodes and periodically reports status to
NameNode. A cluster contains many slaves
File System Namespace
 Its hierarchy is similar to existing file systems
 create, remove, move files
 changes to file system name node or its properties (metadata) are recorded by
namenode (in EditLog)
 replication factor is stored in namenode (in EditLog)
 The file system namespace and the mapping of blocks to file are stored in FsImage
 FsImage and EditLog are central data structures to HDFS
HDFS – Racks
 A Hadoop Cluster is a collection of racks
 A rack is a collection of 30 or 40 nodes that are physically stored close together and are
all connected to the same network switch.
 Network bandwidth between any two nodes in a rack is greater than bandwidth
Installation types:
– Single-node: simple operations +local testing and debugging
– Multi-node cluster: production level operation + thousands of nodes
Hadoop Distributed File System (HDFS)
 Distributed, scalable, fault tolerant, high throughput
 Data access through MapReduce
 Files split into blocks : 3 replicas for each piece of data by default
 Can create, delete, copy, but NOT update
 Designed for streaming reads, not random access
 Data locality: processing data on or near the physical storage to decrease transmission
of data
HDFS Architecture: Master/slave architecture
Master (NameNode) is a piece of software written in java that manages the file system
namespace and metadata & regulates client access to files
HDFS - Blocks
• HDFS is designed to support very large files: each file is split into blocks
o Hadoop default: 64MB
o BigInsights default: 128MB
• Blocks reside on different physical DataNode
• Behind the scenes, 1 HDFS block is supported by multiple operating system blocks
Cours Big Data – résumé fait par Arfaoui A.
HDFS Replication
 HDFS stores data across multiple nodes
 HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple
nodes
 Files are divided into big blocks and 3 copies are “randomly” distributed across the
cluster
Adding File
1. File is added to NameNode memory and persisted in editlog
2. Data is written in blocks to datanodes
o Datanode starts chained copy to two other datanodes
o If at least one write for each block succeeds, write is successful

Blocks of data are replicated to multiple nodes (controlled by replication factor, which
can be configurable per file)
o Default is 3 replicas
 Common case:
o one replica on one node in the local rack
o another replica on a different node in the
different rack
o and the last on a different node in the
same rack as 2nd replica
 This cuts inter-rack network bandwidth, which
improves write performance
NameNode
It holds the metadata for the HDFS like Namespace information, block information etc.
 When in use, all this information is stored in main memory. But these information is also
stored in disk for persistence storage.
 It stores information in disk in two different files:
1. fsimage is the snapshot of the filesystem when namenode started
2. Editlogs is the sequence of changes made to the filesystem after namenode started
NameNode startup
1. NameNode reads fsimage in memory
2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
 It exits safemode when 99% of blocks have at least one copy accounted for
 Namenode doesn’t store block information
NameNode: problem?
 Only at the restart of the NameNode that editlogs are applied to fsimage to get the latest
snapshot of the file system.
 But namenode restart are rare in production clusters which means editlogs can grow
very large for the clusters where namenode runs for a long period of time.
o Editlog become very large, which will be challenging to manage it
o Namenode restart takes long Cme because lot of
changes has to be merged
o In the case of a crash, we will lose huge amount of
metadata since fsimage is very old
Solution? Secondary NameNode
• During operations, primary Namenode cannot merge
fsImage and editlog
 Every couple minutes, secondary namenode copies new editlog from primary
NameNode, merges editLog into fsimage, and copies the new merged fsImage back to
primary namenode
– Secondary NN does not have complete image. In-flight transactions would be lost
– Primary Namenode needs to merge less during startup
Cours Big Data – résumé fait par Arfaoui A.
• Was temporarily deprecated because of Namenode HA but has some advantages (less
network traffic, less moving parts)
Secondary NameNode:
 “Not a hot standby” for the
NameNode
 Connects to NameNode every
hour
 Housekeeping, backup of
NameNode metadata
 Saved metadata can build a
failed NameNode
Managing the Cluster
 Adding Data Node
 Remove Node (better: Add node to exclude file and wait till all blocks have been moved)
 Checking filesystem health (Use hadoop fsck)



The entire cluster participates in the file system
Blocks of a single file are distributed across the cluster
A given block is typically replicated as well for resiliency
The MapReduce programming model
1. "Map" step:
 Input split into pieces
 Worker nodes process individual pieces in parallel (under global control of the Job
Tracker node)
 Each worker node stores its result in its local file system where a reducer is able to access
it
2. "Reduce" step:
 Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of
the Job Tracker)
 Multiple reduce tasks can parallelize the aggregation
HDFS 1.0 has 2 main layers
• Namespace = dirs. + files + blocks
 Supports create, delete, modify and list files or dirs.
operations
• Block Storage
 Block Management
 Supports create/delete/modify/get block location
operations,
 Manages replication and replica placement
 Storage: provides read and write access to blocks
MapReduce Engine: Master / Slave architecture
Single master (JobTracker) controls job execution on multiple slaves (TaskTrackers)
• JobTracker
 Accepts MapReduce jobs submitted by clients
 Pushes map and reduce tasks out to TaskTracker nodes
 Keeps the work as physically close to data as possible
 Monitors tasks and TaskTracker status
• TaskTracker
 Runs map and reduce tasks
 Reports status to JobTracker
 Manages storage and transmission of intermediate output
• Driving principals
 Data is stored across the entire cluster
 Programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
Cours Big Data – résumé fait par Arfaoui A.
How does Hadoop run MapReduce job?
1. Job requests from client applicaCons are received by the JobTracker,
2. JobTracker consults the NameNode in order to determine the locaCon of the required
data.
3. JobTracker locates TaskTracker nodes that contain the data or at least are near the data.
4. The job is submijed to the selected TaskTracker.
5. The TaskTracker performs its tasks while being closely monitored by JobTracker. If the
job fails, JobTracker simply resubmits the job to another TaskTracker. However,
JobTracker itself is a single point of failure, meaning if it fails the whole system goes
down.
6. JobTracker updates its status when the job completes.
7. The client requester can now poll informaCon from JobTracker
MapReduce Tasks
 Local Execution
o Hadoop will attempt to execute splits locally
o If no local Map slot is available, split will be moved to the Map task
 Number Map Tasks
o It is possible to configure the number of Map and Reduce tasks
o If file is not splittable there will only be a single Map task
 Number Reduce Tasks
o Normally there are less Reduce tasks than Map tasks
o Reduce output is written locally to HDFS
o If you need a single output task use one Reduce task
 Redundant Execution
o It is possible to configure redundant execution, i.e. 2 or more Map tasks are
started for each split
 The first Map task for a split that finishes wins.
 In systems with cheap and large numbers of machines, this may
increase performance
 In systems with smaller number of nodes or high quality hardware, it
can decrease overall performance.
 A Hadoop cluster can have a unique HDFS Namespaces.
 Hadoop dedicates all the DataNode resources to Map and Reduce slots with no or little
room for processing any other workload
 Hadoop cannot be used for Real-time processing: it is designed and developed for
massively parallel batch processing.


JobTracker : overburdened
o CPU : spends a very significant portion of time and effort managing the life
cycle of applications
o Network: single Listener Thread to communicate with thousands of Map and
Reduce Jobs
MapReduce MRv1 – Only Map and Reduce tasks: Not possible to run Non-MapReduce
Big Data Applications on HDFS
HADOOP V1 Vs HADOOP V2
Disadvantage of Hadoop:
 Job Tracker : overburdened, it spend
 NameNode : no horizontal scalability (single NameNode and single namespace, limited
by NameNode RAM)
 NameNode: no High Availability (single Point of Failure of NameNode => need manual
recovery using secondary nameNode in case of failure)
Cours Big Data – résumé fait par Arfaoui A.
Features
HDFS federation
NameNode HA
YARN
–
processing
control and multitenancy
Hadoop 1.x
One NameNode
and a namespace
JobTracker,
TaskTracker
Hadoop 2.0
Multiple
NameNode
and
Namespaces
HA
Resource manager, Node manager,
App Master, Capacity scheduler
Hadoop V2 federation:
 Multiple independent Namenodes and Namespace Volumes in a cluster
o Namespace Volume = Namespace + Block Pool
 Block Storage as generic storage service
 Set of blocks for a Namespace Volume is called a Block Pool
 DNs store blocks for all the Namespace Volumes
 Simple design
 Little change to the Namenode, most changes in Datanode, Config and Tools
 Namespace and Block Management remain in Namenode
 Little impact on existing deployments
 Single namenode configuration runs as is
 Datanodes provide storage services for all the namenodes
 Register with all the namenodes
 Send periodic heartbeats and block reports to all the namenodes
 Send block received/deleted for a block pool to corresponding namenode
• HDFS Federation helps HDFS Scale horizontally by:
1) Reducing the load on any single NameNode by using the multiple, independent
NameNodes to manage individual parts of the file system namespace.
2) Providing cross-data centre (non-local) support for HDFS, allowing a cluster
administrator to split the Block Storage outside the local cluster.
 (A). In order to scale the name service horizontally, HDFS federation uses multiple
independent NameNodes. The NameNodes are federated, that is, the NameNodes are
independent and do not require coordination with each other.
HDFS-2 HA:
• HDFS-2 adds Namenode High Availability
• Standby Namenode needs filesystem transactions and block locations for fast failover
• Every filesystem modification is logged to at least 3 quorum journal nodes by active
Namenode
– Standby Node applies changes from journal nodes as they occur
– Majority of journal nodes define reality
– Split Brain is avoided by Journalnodes (They will only allow one Namenode to write to
them)
• Datanodes send block locations and heartbeats to both Namenodes
• Memory state of Standby Namenode is very close to Active Namenode
—> Much faster failover than cold start
Hadoop 2:
Problems solved so far:
 Scale: Multiple name nodes - Hadoop Federation
 Name Node failure: Hadoop HA
 Burden on the Job Tracker (JobTracker to perform many activities: Resource
Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs, etc.)
 Solution = YARN
YARN
Hadoop 2.x solved Hadoop 1.x Limitations by using new architecture by:
 Decoupling MapReduce component responsibilities into different components.
 Introducing new YARN component for Resource management.
There are two main ideas with YARN:
 Provide generic scheduling and resource management.
Cours Big Data – résumé fait par Arfaoui A.
 This way Hadoop can support more than just MapReduce.
 Provide more efficient scheduling and workload management.
•
YARN brings significant performance improvements for some applications, supports
additional processing models, and implements a more flexible execution engine.
• YARN is a resource manager that was created by separating the processing engine and
resource management capabilities of MapReduce as it was implemented in Hadoop 1.
• YARN is often called the operating system of Hadoop because it is responsible for
managing and monitoring workloads, maintaining a multi-tenant environment,
implementing security controls, and managing high availability features of Hadoop.
• YARN is designed to allow multiple, diverse user applications to run on a multi-tenant
platform.
• YARN supports multiple processing models in addition to MapReduce
• MapReduce has undergone a complete overhaul with YARN, splitting up the two major
functionalities of JobTracker (resource management and job scheduling/ monitoring)
into separate daemons
 ResourceManager (RM)
o The ResourceManager is the ultimate authority that arbitrates resources
among all the applicaCons in the system
 ApplicationMaster (AM)
o It is a framework specific library and is tasked with negotiating resources
from the ResourceManager and working with the NodeManager(s) to
execute and monitor the tasks
Hadoop 1.x Job Tracker component is divided into two components:
 Resource Manager: To manage resources in a cluster
 Application Master: To manage applications like MapReduce, Spark etc.
YARN features
 Multi-tenancy
o YARN allows multiple access engines (either Open-Src or proprietary) to use
o Hadoop as the common standard for batch, interactive, and real-time engines
that can simultaneously access the same data sets
o Multi-tenant data processing improves an enterprise's return on its Hadoop
investments.
 Cluster utilization
o YARN's dynamic allocation of cluster resources improves utilization over more
static MapReduce rules used in early versions of Hadoop
 Scalability
o Data center processing power continues to rapidly expand. YARN's
ResourceManager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
 Compatibility
o Existing MapReduce applications developed for Hadoop 1 can run YARN
without any disruption to existiting processes that already work
 Reliability and availability
o High availability for the ResourceManager
o An application recovery is performed after the restart of ResourceManager
o The ResourceManager stores information about running applications and
completed tasks in HDFS
o If the ResourceManager is restarted, it recreates the state of applications and
reruns only incomplete tasks
o Highly available NameNode, making the Hadoop cluster much more efficient,
powerful, and reliable
Mapreduce V1
Cours Big Data – résumé fait par Arfaoui A.
HA architecture YARN V2
Hadoop 2 Limitations/disadvantages
 Minimum runtime Java 7 version
 Replication is very costly
o 3x replication scheme leading to 200% additional storage space and resource
overhead.
 Support for only 2 NameNodes (Active NameNode + StandBy NameNode)
o did not provide maximum level of fault tolerance
 Shell scripts difficult to understand
o hadoop developers have to read almost all the shell scripts to understand what
is the correct environment variable to set an option and how to set it whether
it is java.library.path or java classpath.
Hadoop V2 Vs Hadoop V3
How YARN runs applications
Summary:
Hadoop 3
1. Eraser Encoding
• In hadoop V2, the default replication factor is 3:
 Every piece of data is replicated twice to ensure reliability of 99.999%.
ex: 6 blocks will consume 6*3 = 18 blocks of disk space.
 Replicating the data blocks to 3 data nodes incurs 200% additional storage
overhead and network bandwidth when writing data.
 Solution: Erasure coding (EC)
• RAID implements EC through striping,
1) the logically sequential data (such as a file) is divided into smaller units (such as bit,
byte, or block) and stores consecutive units on different disks.
2) for each stripe of original data cells, a certain number of parity cells are calculated
and stored. This process is called encoding.
3) the error on any striping cell can be recovered through decoding calculation based
on surviving data cells and parity cells.
• Integrating EC with HDFS can maintain the same fault-tolerance with improved storage
efficiency.
• ex. (6 data, 3 parity) deployment will only consume 9 blocks (6 data blocks + 3 parity
blocks) of disk space —> storage overhead up to 50%.
Cours Big Data – résumé fait par Arfaoui A.





Erasure codes, also known as forward error correction (FEC) codes.
Erasure coding (EC) is a method of data protection, in which data is broken into
fragments, expanded and encoded with redundant data pieces and stored across a set of
different locations or storage media.
Erasure codes are often used instead of traditional RAID because of their ability to reduce
the time and overhead required to reconstruct data.
The drawback of erasure coding is that it can be more CPU-intensive, and that can
translate into increased latency.
Erasure coding can be useful with large quantities of data and any applications or
systems that need to tolerate failures, such as data grids, distributed storage
applications…
Methodology:
 Erasure coding creates a mathematical function to describe a set of numbers so they
can be checked for accuracy and recovered if one is lost.
 Referred to as polynomial interpolation or oversampling, this is the key concept behind
erasure codes.
 In mathematical terms, the protection offered by erasure coding can be represented in
simple form by the following equation: n = k + m.
o The variable “k” is the original amount of data or symbols.
o The variable “m” stands for the extra or redundant symbols that are
added to provide protection from failures, also called parity blocks.
o The variable “n” is the total number of symbols created after the erasure
coding process.
 Ex: in a 10 of 16 configuration, or EC 10/16, six extra symbols (m) would be added
to the 10 base symbols (k). The 16 data fragments (n) would be spread across 16
drives, nodes or geographic locations. The original file could be reconstructed
from 10 verified fragments.
o

YARN version 1 is limited to a single instance of writer/reader and does not
scale well beyond small clusters. Version 2 uses a more scalable distributed
writer architecture and a scalable backend storage.
Enhancing usability by introducing flows and aggregation
o in many cases, users are interested in the information at the level of “flows”
or logical groups of YARN applications. It is much more common to launch a
set or series of YARN applications to complete a logical application. Timeline
Service v.2 supports the notion of flows explicitly.
o it, also, supports aggregating metrics at the flow level
3. Support for More than 2 NameNodes
In Hadoop 2.x,:
HDFS NameNode HA architecture has a single active NameNode and a single Standby
NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is
able to tolerate the failure of any one NameNode.
 Business critical deployments require higher degrees of fault-tolerance:
 in Hadoop 3:
Allows users to run multiple standby NameNodes.
ex. configuring three NameNodes (1 active and 2 passive) and five JournalNodes, the
cluster can tolerate the failure of two nodes

HDFS Erasure Encoding: Architecture
 NameNode Extensions – The HDFS files are striped into block groups, which have a
certain number of internal blocks. Now to reduce NameNode memory consumption
from these additional blocks, a new hierarchical block naming protocol was
introduced. The ID of a block group can be deduced from the ID of any of its internal
blocks. This allows management at the level of the block group rather than the block.
 DataNode Extensions – The DataNode runs an additional ErasureCodingWorker
(ECWorker) task for background recovery of failed erasure coded blocks. Failed EC
blocks are detected by the NameNode, which then chooses a DataNode to do the
recovery work.
2. YARN Timeline Service v.2
YARN Timeline Service is developed to address two major challenges:
 Improving scalability and reliability of Timeline Service
Cours Big Data – résumé fait par Arfaoui A.
Hortonworks Data Platform, HDP ?
 100% open source framework
 For distributed storage and processing of large, multi-source data sets
 Centrally architected with YARN at its core
 Interoperable with existing technology and skills,
 Enterprise-ready, with data services for operations, governance and security
Scoop
 It is a tool to easily import information from structured databases (MySQL, oracle…)
and related Hadoop systems (such as Hive and HBase) into your Hadoop cluster
 It can be used to extract data from Hadoop and export it to relational databases and
enterprise data warehouses
 It helps offload some tasks such as ETL from Enterprise Data Warehouse to Hadoop for
lower cost and efficient execution
Flume
 Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming event data.
 Flume helps you aggregate data from many sources, manipulate the data, and then
add the data into your Hadoop environment.
 Its functionality is now superseded by HDF / Apache Nifi.
Kafka
 Kafka Apache is a messaging system used for real-time data pipelines.
 It is used to build real-time streaming data pipelines that get data between systems or
applications.
 It works with a number variety of Hadoop tools for various applications.
Examples of use cases are:
o Website activity tracking: capturing user site activities for real-time tracking/
monitoring
o Log aggregation: collecting logs from various sources to a central location for
processing.
o Stream processing: article recommendations based on user activity
Hive
 Apache Hive is a data warehouse system built on top of Hadoop.
 Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large
datasets that are stored in Hadoop.
 Hive provides SQL on Hadoop: SQL interface, better known as HiveQL or HQL, which
allows for easy querying of data in Hadoop
 Includes HCatalog: Global metadata management layer that exposes Hive table
metadata to other Hadoop applications.
Pig
 Apache Pig is a platform for analyzing large data sets.
 Pig was designed for scripting a long series of data operations (good for ETL)
 Pig consists of a high-level language called Pig Latin, which was designed to simplify
MapReduce programming.
 Pig's infrastructure layer consists of a compiler that produces sequences of
MapReduce programs from this Pig Latin code that you write.
 The system is able to optimize your code, and "translate" it into MapReduce allowing
you to focus on semantics rather than efficiency.
HBase
 Apache HBase is a distributed, scalable, big data store.
 Use Apache HBase when you need random, real-time read/write access to your Big
Data.
 The goals of the HBase project is to be able to handle very large tables of data running
on clusters of commodity hardware.
 HBase is modeled after Google's BigTable and provides BigTable-like capabilities on top
of Hadoop and HDFS.
 HBase is a NoSQL datastore.
 HBase is not designed for transactional processing.
Accumulo
 Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable
data storage and retrieval.
 It is based on Google’s BigTable and runs on YARN (“highly secure HBase")
 Features:
o Server-side programming
o Designed to scale
Cours Big Data – résumé fait par Arfaoui A.
o
o
o
Cell-based access control
Stable
Phoenix
 Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency
applications by combining the best of both worlds:
o The power of standard SQL and JDBC APIs with full ACID transaction
capabilities.
o The flexibility of late-bound, schema-on-read capabilities from the NoSQL
world by leveraging HBase as its backing store.
 Essentially this is SQL for NoSQL
 Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and
MapReduce
Storm
 Apache Storm is an open source distributed real-time computation system.
o Fast
o Scalable
o Fault-tolerant
 Used to process large volumes of high-velocity data
 Useful when milliseconds of latency matter and Spark isn't fast enough
o Has been benchmarked at over a million tuples processed per second per
node
Solr
 Apache Solr is a fast, open source enterprise search platform built on the Apache
Lucene Java search library
 Full-text indexing and search
o REST-like HTTP/XML and JSON APIs make it easy to use with variety of
programming languages
 Highly reliable, scalable and fault tolerant, providing distributed indexing, replication
and load-balanced querying, automated failover and recovery, centralized
configuration and more
Spark
 Apache Spark is a fast and general engine for large-scale in-memory data processing.
 It has a number of built-in libraries that sits on top of the Spark core, which takes
advantage of all its capabilities. Spark ML, Spark's GraphX, Spark Streaming, Spark SQL
and DataFrames.
 Spark has a variety of advantages including:
o Speed: Run programs faster than MapReduce
o Easy to use: Write apps quickly with Java, Scala, Python, R
o Generality: Can combine SQL, streaming, and complex analytics
Runs on variety of environments and can access diverse data sources:
o Hadoop, Mesos, standalone, cloud...
o HDFS, HBase,…
Druid
 Apache Druid is a high-performance, column-oriented, distributed data store.
 It has a unique architecture that enables rapid multi-dimensional filtering, adhoc
attribute groupings, and extremely fast aggregations
 It supports real-time streams
o Lock-free ingestion to allow for simultaneous ingestion and querying of high
dimensional, high volume data sets
o Explore events immediately after they occur
 It is a datastore designed for business intelligence (OLAP) queries.
 It integrates with Apache Hive to build OLAP cubes and run sub-seconds queries.
Falcon
 Apache Druid is a high-performance, column-oriented, distributed data store.
 Framework for managing data life cycle in Hadoop clusters
 It is data governance engine
o Defines, schedules, and monitors data management policies
 It addresses enterprise challenges related to Hadoop data replication, business
continuity, and lineage tracing by deploying a framework for data management
 and processing
Atlas
 Apache Atlas is a scalable and extensible set of core foundational governance services
o It enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
 It exchanges metadata with other tools and processes within and outside of the
Hadoop
o Allows integration with the whole enterprise data ecosystem
 Atlas Features:
o Data Classification
o Centralized Auditing
o Centralized Lineage
o Security & Policy Engine
Ranger
 Centralized security framework to enable, monitor and manage comprehensive data
security across the Hadoop platform
 Manage fine-grained access control over Hadoop data access components like
 Apache Hive and Apache HBase
Cours Big Data – résumé fait par Arfaoui A.


Ranger console can manage policies for access to files, folders, databases, tables, or
column with ease
Policies can be set for individual users or groups
Ambari
 For provisioning, managing, and monitoring Apache Hadoop clusters.
 Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs
 Ambari REST APIs allows application developers and system integrators to easily
integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications
Cloudbreak
 A tool for provisioning and managing Apache Hadoop clusters in the cloud
 Policy-based autoscaling on the major cloud infrastructure platforms, including:
o Microsoft Azure
o Amazon Web Services
o Google Cloud Platform
o OpenStack
Zookeeper
 Apache ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing
 group services
 All of these kinds of services are used in some form or another by distributed
applications
o Saves time so you don't have to develop your own
 It is fast, reliable, simple and ordered
 Distributed applications can use ZooKeeper to store and mediate updates to important
configuration information
Zeppelin
 Apache Zeppelin is a Web-based notebook that enables data-driven, interactive data
analytics and collaborative documents
 Documents can contain SparkSQL, SQL, Scala, Python, JDBC connection, and much
more
 Easy for both end-users and data scientists to work with
 Notebooks combine code samples, source data, descriptive markup, result sets, and
rich visualizations in one place
Cours Big Data – résumé fait par Arfaoui A.
Data mining Vs Big Data
Mining useful information remains difficult for some real-world applications
 distributed data mining or machine learning
Two important differences between data mining and big data:
Negative side:
 Methods for big data analytics are not yet mature
Positive side:
 Some argue that the almost unlimited data make it easier to mine information
Advantages of distributed Data Analytics
Parallel data loading
 Reading several TB data from disk is slow
 Using 100 machines, each has 1/100 data in its local disk ⇒ 1/100 loading time
 But having data ready in these 100 machines is another issue
Fault tolerance
 Some data replicated across machines: if one fails, others are still available
 If data is already distributedly stored, it’s not convenient to reduce some to one
machine for analysis
Disadvantages of distributed Data Analytics
 More complicated Communication and synchronization
 Everybody says moving computation to data, but this isn’t that easy!
Distributed environment
 Many easy tasks on one computer become difficult in a distributed environment
For example, subsampling is easy on one machine, but may not be in a distributed system
 Usually, the problem is attributed to slow communication between machines
Challenges: Big data, small analysis Vs Big data, big analysis
 If you need a single record from a huge set, it’s reasonably easy
For example, accessing your high-speed rail reservation is fast
 if you want to analyze the whole set by accessing data several time, it can be much
harder
 Most existing data mining/machine learning methods were designed without
considering data access and communication of intermediate results
They iteratively use data by assuming they are readily available
Example: doing least-square regression isn’t easy in a distributed environment
Algorithms for distributed analysis
 This is an on-going research topic.
 There are two types of approaches
• Parallelize existing (single-machine) algorithms
• Design new algorithms particularly for distributed settings
• There are things in between
Algorithms and systems
 To have technical breakthroughs for big-data analytics, we should know both
algorithms and systems well, and consider them together
 If you are an expert on both topics, everybody wants you now
 Many machine learning Ph.D. students don’t know much about systems. But this isn’t
the case in the early days of computer science
Design consideration of big data algorithms
 Generally, we have to minimize the data access and communication in a distributed
environment
 It’s possible that method A better than B on one computer but method A worse than B
in distributed environments
 Example: on one computer, often we do batch rather than online learning
o Online and streaming learning may be more useful for big-data applications
 Example: very often we design synchronous parallel algorithms
o Maybe asynchronous ones are better for big data?
Risks
 Two problems: Technology limits + Applicability limits
Risk: technology limit
 It’s possible to not get satisfactory results because of the distributed configuration
 Parallel programming or HPC (high performance computing) wasn’t very successful in
early 90’s. But there are two differences this time
o We are using commodity machines
o Data become the focus
 Every area has its limitation. The degree of success varies
 Let’s compare two matrix products:
o Dense matrix products: very successful as the final outcome (optimized BLAS)
is much better than what ordinary users wrote
o Sparse matrix products: not as successful. ordinary code is about as good as
those provided by Matlab
 For big data analytics, it’s too early to tell
Risk: applicability limits
 What’s the percentage of applications that need big-data analytics?
o Not clear.
o Some think the percentage is small (they think big-data analytics is a hype)
 One main reason is that you can always analyze a random subset on one machine
 Big-data analytics is in its infancy
 It is challenging to develop algorithms and tools in a distributed environment
 We should take both algorithms and systems into consideration
Cours Big Data – résumé fait par Arfaoui A.
Text mining
Data Mining : l’exploration et l’analyse, par des moyens automatiques ou semiautomatiques, d’un large volume de données afin de découvrir des tendances ou des
règles» (M. Berry).
Text Mining est l’ensemble des techniques de data mining permettant le traitement des
données particulières que sont les données textuelles.
Text Mining est le processus d’extraction de structures (connaissances) inconnues, valides
et potentiellement exploitables dans les documents textuels, à travers la mise en oeuvre de
techniques statistiques ou de machine learning
Text Mining est l’ensemble des technologies et méthodes destinées au traitement
automatique de données textuelles disponibles sous forme informatique, en assez grande
quantité en vue d’en dégager et structurer le contenu dans une perspective d’analyse
rapide de découverte d’informations cachées ou de prise automatique de décision
Motivation derrière le text mining
 l’explosion des données: masse importante de données textuelles
 les capacités de stockage et de calcul offertes par le matériel et les techniques
informatiques modernes
 la recherche en Intelligence Artificielle et en théorie de l’apprentissage
Techniques de traitement automatique du langage (TAL)
 TAL basée sur la linguistique et la sémantique :
 Grammaires génératives
 Réseaux sémantiques
 Représentation du sens par des schémas
 Gros volume de données
 Beaucoup de bruit dans les données
 Besoin d’une expertise forte

TAL basées sur la statistique et Machine Learning :
 Avoir des représentations utiles mais pauvres
 Compenser par le volume
Text mining s’agit d’une approche basée sur la statistique et machine learning


Machine learning: données sous format de table/matrice: "attributs-valeurs".
Défi: transformer le texte en un tableau/matrice de données adéquat au traitement
par les algorithmes du machine learning, en minimisant au possible la perte
d’information
Représentation vectorielle
1. Nettoyage du texte
2. Normalisation du texte
3. Tokenization : Fractionner une chaîne de caractères en un ensemble de tokens
4. Indexation :
a. identification des bag of words : ne tient pas compte de la séquence dans laquelle les
mots apparaissent dans un document + Indépendant de la langue
b. N-gramme en mots : séquence de n termes adjacents (consécutifs ou apparaissant
dans une fenêtre restreinte) que l’on extrait en tant qu’ index : l’idée est que
l’association des termes introduit une signification différente de celles qu’ils véhiculent
individuellement.
c. N-gramme en lettres : séquence de n caractères consécutifs (contigus) que l’on extrait
en tant qu’index.
5. Dictionnaire
C’est l’ensemble des index (i.e. attributs) apparaissant dans les textes du corpus
6. Pondération
Chaque document est représenté par un vecteur. Quelles valeurs mettre dans le tableau?
a. Pondération binaire:
 Comptabiliser la présence de chaque terme dans le document, sans se préoccuper du
nombre d’occurrences (de la répétition)
 Avantages
o Simplicité
o Forme de « lissage » de l’information en donnant la même importance à tous
les termes
o Adaptée à certaines techniques (ex. règles d’association) et mesures de
distance (ex. Jaccard)
 Inconvénients
o Une partie de l’information n’est pas captée (perte d’information), dont
pourrait tirer profit certaines catégories de techniques de ML
Cours Big Data – résumé fait par Arfaoui A.



o Pourquoi donner la même importance à tous les termes ?
b. Fréquence des termes (TF: Term Frequency):
Comptabiliser le nombre d’occurrence des termes. Indicateur de l’importance du
terme dans le document.
Avantages
o On capte plus d’information, la répétition d’un terme dans le document est
prise en compte
o Des techniques savent prendre en compte ce type d’information (calcul
matriciel)
Inconvénients
o Les écarts entre documents sont exagérés (ex. si on utilise une distance
euclidienne)
Les normalisations des fréquences permettent d’amortir les écarts et/ou de tenir compte
de la longueur des documents

c. Inverse document frequency: IDF
Un terme présent dans presque tout le corpus (D) influe peu quand il apparaît dans un
document. A l’inverse, un terme rare apparaissant dans un document doit retenir
notre attention. L’IDF mesure l’importance d’un terme dans un corpus
d. TF-IDF
Relativiser l’importance d’un terme dans un document (TF) par son importance dans le
corpus (IDF).




















La technique fait à la fois référence à un dic`onnaire, et à l’analyse morphosyntaxique
des mots
Elle est spécifique à chaque langue.
Des erreurs sont toujours possibles !
3. Stemming (racinisation)
Le stemming consiste à réduire un mot à sa racine (stem), qui peut ne pas exister.
L’algorithme de Porter applique une succession de règles (mécaniques) pour réduire la
longueur des mots c.-à-d. supprimer la fin des mots.
• Le stemming est un traitement final, qui n’autorise plus de post-traitements sur les
mots
• Le stemming peut conduire à des regroupements erronés (ex. marmite, marmaille,
marm)
4. Filtrage par fréquence
Fréquence : nombre de documents où le terme apparaît au moins une fois rapporté au
nombre total de documents.
Fréquence trop élevée (termes présents dans presque tous les documents) : permet
peut-être de cerner le domaine, mais ne permet pas de différencier les documents (ex.
databases, image).
Fréquence trop faible (termes présents dans de très rares documents) : ne permet pas
de caractériser une différence significative entre les documents.
Le choix des seuils reste arbitraire.
5. Autres solutions : correcteur orthographique

L’outil s’appuie sur un dictionnaire où les termes sont correctement orthographiés. Si
le mot à évaluer y est présent  OK. S’il n’y est pas, on recense les mots les plus
proches. Le ou les plus proches sont proposés.
 Une mesure spécifique aux chaînes de caractères doit être utilisée. La plus connue est
celle de Levensthein (distance d’édition).
 Attention de ne pas corriger à tort et à travers (ex. les noms propres)
6. Autres solutions : thésaurus
Certains mots sont synonymes ou recouvrent le même concept (ex. store vs. save)
Représentation vectorielle :
La dimensionnalité est souvent très élevée.
Le nombre de colonnes peut excéder le nombre de lignes.
Problème pour les algorithmes de machine learning, souvent.
La réduction de la dimensionnalité est un enjeu crucial.
Reduction de la dimensionalité


1. Retrait des stopwords :
Un mot vide est un mot communément utilisé dans une langue, non porteur de sens
dans un document (ex. préposition, pronoms, etc.).
Formellement, sa fréquence d’apparition est la même dans tous les documents.
De fait, les mots vides ne permettent pas de discriminer les documents (de distinguer
les documents les uns des autres),
ils sont inutilisables en text mining tel que nous le concevons (catégorisation,
recherche d’information, etc.)
2. Lemmatization
La lemmatisation consiste à analyser les termes de manière à identifier sa forme
canonique (lemme), qui existe réellement.
L’idée est de réduire les différentes formes (pluriel, féminin, conjugaison, etc.) en une
seule.
Similarité
Les mesures de similarité sont nécessaire dans de nombreuses méthodes de data mining
(visualisation, classification supervisée et non supervisée). Elles caractérisent les
ressemblances entre les objets.
Propriétés des mesures de similarité : non-négativité, symétrie, maximalité, normalisation
(slide71)
Une mesure de dissimilarité caractérise les différences entre les objets.
 Déduire une dissimilarité à partir d’une similarité
 Une distance est aussi une mesure de dissimilarité
 Distance euclidienne
Cours Big Data – résumé fait par Arfaoui A.

Similarité et distance cosinus : ne s’intéresse qu’aux cooccurrences. La normalisation
permet de comparer des documents de longueurs différentes
 Déduire une distance à partir de la similarité cosinus
 Indice et distance de Jaccard : adapté à la pondération binaire. S’intéresse aux
 cooccurrences et bénéficie également d’un mécanisme de normalisation.
 Déduire une distance à partir de l’indice de Jaccard
Cours Big Data – résumé fait par Arfaoui A.
Download