Big data is a set of technologies, architectures, tools and procedures allowing an organization to very quickly capture, process and analyze large quantities, heterogeneous and changing content, to extract the relevant information at an accessible cost. Big data applications: education, manufacturing, security, retail, financial services: fraud detection, risk management telecommunication: churn prediction, geomapping/marketing, network monitoring healthcare: Epidemic early warning & Intensive Care Unit and remote monitoring Big Data Vs: Variety: Different forms of data sources & different types of data Velocity: Speed of generation of data, Real time data from different data sources Veracity: uncertainty of data Value: What can we do with this data? Volume: data size Variability: It is about the context in data Visualization: It means making data talk, story of data Big Data Challenges Storage and transport issues: The quantity of available data has exploded… Management issues: resolving issues of access, usage, update… Processing issues: assume that one Exabyte data needs to be processed Quality Vs Quantity: How much data is needed to extract good knowledge from it? Data Ownership: who owns data? Compliance and security: in health and social media domains, data is accumulated about individuals. Big Data = Big Data Analytics “using data that was previously ignored because of technology limitations” Big data problems: Hardware improvements through the years... Disk Capacity RAM Memory CPU Speeds: Disk Latency (speed of reads and writes) Solution Use multiple processors/disks to solve the same problem by fragmenting it into pieces. Parallel Data Processing? Parallel data processing was with us for a while: GRID computing - spreads processing load Distributed workload - hard to manage applications, overhead on developer Parallel databases – DB2 DPF, Teradata, Netezza, etc (distribute the data) Distributed computing: Multiple computers appear as one super computer, communicate with each other by message passing, operate together to achieve a common goal Challenges: Heterogeneity, Security, Scalability, Concurrency, Fault tolerance, Transparency Need to process huge datasets on large clusters of computers Very expensive to build reliability into each application Nodes fail every day The number of nodes in a cluster is not constant Need a common infrastructure efficient, reliable, easy to use, Open Source, Apache Licence : Solution? Hadoop Hadoop is an OS software framework for reliable, scalable, distributed computing of massive amount of data, Hides underlying system details and complexities from user Consists of 3 sub projects: MapReduce+ HDFS+ Hadoop Common Supported by several Hadoop-related projects: HBase, Zookeeper, Avro Meant for heterogeneous commodity hardware Design principles of Hadoop New way of storing and processing the data: Let the system handles most of the issues automatically (Failures, Scalability, Reduce communications) Distribute data and processing power to where the data is Make parallelism part of the operating system Brings processing to Data! Optimized to handle Massive amounts of data through parallelism A variety of data (structured, unstructured, semi-structured) Using inexpensive commodity hardware, Relatively inexpensive hardware Reliability provided through replication Hadoop V1 is not for all types of work! Not to process transactions (random access) Not good when work cannot be parallelized Not good for low latency data access Not good for processing lots of small files Not good for intensive calculations with little data Apache Hadoop? • Flexible, enterprise-class support for processing large volumes of data Well-suited to batch-oriented, read-intensive applications Supports wide variety of data • Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks = “node” Nodes can be combined into clusters New nodes can be added as needed without changing (Data formats, How data is loaded, How jobs are written) Cours Big Data – résumé fait par Arfaoui A. Two key aspects of Hadoop? • Hadoop Distributed File System = HDFS Where Hadoop stores data A file system that spans all the nodes in a Hadoop cluster It links together the file systems on many local nodes to make them into one big file system o Distributed o Reliable o Commodity gear •MapReduce framework How Hadoop understands and assigns work to the nodes (machines) o Parallel programming o Fault tolerant Slave (DataNode) manages storage attached to the nodes and periodically reports status to NameNode. A cluster contains many slaves File System Namespace Its hierarchy is similar to existing file systems create, remove, move files changes to file system name node or its properties (metadata) are recorded by namenode (in EditLog) replication factor is stored in namenode (in EditLog) The file system namespace and the mapping of blocks to file are stored in FsImage FsImage and EditLog are central data structures to HDFS HDFS – Racks A Hadoop Cluster is a collection of racks A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than bandwidth Installation types: – Single-node: simple operations +local testing and debugging – Multi-node cluster: production level operation + thousands of nodes Hadoop Distributed File System (HDFS) Distributed, scalable, fault tolerant, high throughput Data access through MapReduce Files split into blocks : 3 replicas for each piece of data by default Can create, delete, copy, but NOT update Designed for streaming reads, not random access Data locality: processing data on or near the physical storage to decrease transmission of data HDFS Architecture: Master/slave architecture Master (NameNode) is a piece of software written in java that manages the file system namespace and metadata & regulates client access to files HDFS - Blocks • HDFS is designed to support very large files: each file is split into blocks o Hadoop default: 64MB o BigInsights default: 128MB • Blocks reside on different physical DataNode • Behind the scenes, 1 HDFS block is supported by multiple operating system blocks Cours Big Data – résumé fait par Arfaoui A. HDFS Replication HDFS stores data across multiple nodes HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes Files are divided into big blocks and 3 copies are “randomly” distributed across the cluster Adding File 1. File is added to NameNode memory and persisted in editlog 2. Data is written in blocks to datanodes o Datanode starts chained copy to two other datanodes o If at least one write for each block succeeds, write is successful Blocks of data are replicated to multiple nodes (controlled by replication factor, which can be configurable per file) o Default is 3 replicas Common case: o one replica on one node in the local rack o another replica on a different node in the different rack o and the last on a different node in the same rack as 2nd replica This cuts inter-rack network bandwidth, which improves write performance NameNode It holds the metadata for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information is also stored in disk for persistence storage. It stores information in disk in two different files: 1. fsimage is the snapshot of the filesystem when namenode started 2. Editlogs is the sequence of changes made to the filesystem after namenode started NameNode startup 1. NameNode reads fsimage in memory 2. NameNode applies editlog changes 3. NameNode waits for block data from data nodes It exits safemode when 99% of blocks have at least one copy accounted for Namenode doesn’t store block information NameNode: problem? Only at the restart of the NameNode that editlogs are applied to fsimage to get the latest snapshot of the file system. But namenode restart are rare in production clusters which means editlogs can grow very large for the clusters where namenode runs for a long period of time. o Editlog become very large, which will be challenging to manage it o Namenode restart takes long Cme because lot of changes has to be merged o In the case of a crash, we will lose huge amount of metadata since fsimage is very old Solution? Secondary NameNode • During operations, primary Namenode cannot merge fsImage and editlog Every couple minutes, secondary namenode copies new editlog from primary NameNode, merges editLog into fsimage, and copies the new merged fsImage back to primary namenode – Secondary NN does not have complete image. In-flight transactions would be lost – Primary Namenode needs to merge less during startup Cours Big Data – résumé fait par Arfaoui A. • Was temporarily deprecated because of Namenode HA but has some advantages (less network traffic, less moving parts) Secondary NameNode: “Not a hot standby” for the NameNode Connects to NameNode every hour Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode Managing the Cluster Adding Data Node Remove Node (better: Add node to exclude file and wait till all blocks have been moved) Checking filesystem health (Use hadoop fsck) The entire cluster participates in the file system Blocks of a single file are distributed across the cluster A given block is typically replicated as well for resiliency The MapReduce programming model 1. "Map" step: Input split into pieces Worker nodes process individual pieces in parallel (under global control of the Job Tracker node) Each worker node stores its result in its local file system where a reducer is able to access it 2. "Reduce" step: Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker) Multiple reduce tasks can parallelize the aggregation HDFS 1.0 has 2 main layers • Namespace = dirs. + files + blocks Supports create, delete, modify and list files or dirs. operations • Block Storage Block Management Supports create/delete/modify/get block location operations, Manages replication and replica placement Storage: provides read and write access to blocks MapReduce Engine: Master / Slave architecture Single master (JobTracker) controls job execution on multiple slaves (TaskTrackers) • JobTracker Accepts MapReduce jobs submitted by clients Pushes map and reduce tasks out to TaskTracker nodes Keeps the work as physically close to data as possible Monitors tasks and TaskTracker status • TaskTracker Runs map and reduce tasks Reports status to JobTracker Manages storage and transmission of intermediate output • Driving principals Data is stored across the entire cluster Programs are brought to the data, not the data to the program • Data is stored across the entire cluster (the DFS) Cours Big Data – résumé fait par Arfaoui A. How does Hadoop run MapReduce job? 1. Job requests from client applicaCons are received by the JobTracker, 2. JobTracker consults the NameNode in order to determine the locaCon of the required data. 3. JobTracker locates TaskTracker nodes that contain the data or at least are near the data. 4. The job is submijed to the selected TaskTracker. 5. The TaskTracker performs its tasks while being closely monitored by JobTracker. If the job fails, JobTracker simply resubmits the job to another TaskTracker. However, JobTracker itself is a single point of failure, meaning if it fails the whole system goes down. 6. JobTracker updates its status when the job completes. 7. The client requester can now poll informaCon from JobTracker MapReduce Tasks Local Execution o Hadoop will attempt to execute splits locally o If no local Map slot is available, split will be moved to the Map task Number Map Tasks o It is possible to configure the number of Map and Reduce tasks o If file is not splittable there will only be a single Map task Number Reduce Tasks o Normally there are less Reduce tasks than Map tasks o Reduce output is written locally to HDFS o If you need a single output task use one Reduce task Redundant Execution o It is possible to configure redundant execution, i.e. 2 or more Map tasks are started for each split The first Map task for a split that finishes wins. In systems with cheap and large numbers of machines, this may increase performance In systems with smaller number of nodes or high quality hardware, it can decrease overall performance. A Hadoop cluster can have a unique HDFS Namespaces. Hadoop dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload Hadoop cannot be used for Real-time processing: it is designed and developed for massively parallel batch processing. JobTracker : overburdened o CPU : spends a very significant portion of time and effort managing the life cycle of applications o Network: single Listener Thread to communicate with thousands of Map and Reduce Jobs MapReduce MRv1 – Only Map and Reduce tasks: Not possible to run Non-MapReduce Big Data Applications on HDFS HADOOP V1 Vs HADOOP V2 Disadvantage of Hadoop: Job Tracker : overburdened, it spend NameNode : no horizontal scalability (single NameNode and single namespace, limited by NameNode RAM) NameNode: no High Availability (single Point of Failure of NameNode => need manual recovery using secondary nameNode in case of failure) Cours Big Data – résumé fait par Arfaoui A. Features HDFS federation NameNode HA YARN – processing control and multitenancy Hadoop 1.x One NameNode and a namespace JobTracker, TaskTracker Hadoop 2.0 Multiple NameNode and Namespaces HA Resource manager, Node manager, App Master, Capacity scheduler Hadoop V2 federation: Multiple independent Namenodes and Namespace Volumes in a cluster o Namespace Volume = Namespace + Block Pool Block Storage as generic storage service Set of blocks for a Namespace Volume is called a Block Pool DNs store blocks for all the Namespace Volumes Simple design Little change to the Namenode, most changes in Datanode, Config and Tools Namespace and Block Management remain in Namenode Little impact on existing deployments Single namenode configuration runs as is Datanodes provide storage services for all the namenodes Register with all the namenodes Send periodic heartbeats and block reports to all the namenodes Send block received/deleted for a block pool to corresponding namenode • HDFS Federation helps HDFS Scale horizontally by: 1) Reducing the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace. 2) Providing cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster. (A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other. HDFS-2 HA: • HDFS-2 adds Namenode High Availability • Standby Namenode needs filesystem transactions and block locations for fast failover • Every filesystem modification is logged to at least 3 quorum journal nodes by active Namenode – Standby Node applies changes from journal nodes as they occur – Majority of journal nodes define reality – Split Brain is avoided by Journalnodes (They will only allow one Namenode to write to them) • Datanodes send block locations and heartbeats to both Namenodes • Memory state of Standby Namenode is very close to Active Namenode —> Much faster failover than cold start Hadoop 2: Problems solved so far: Scale: Multiple name nodes - Hadoop Federation Name Node failure: Hadoop HA Burden on the Job Tracker (JobTracker to perform many activities: Resource Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs, etc.) Solution = YARN YARN Hadoop 2.x solved Hadoop 1.x Limitations by using new architecture by: Decoupling MapReduce component responsibilities into different components. Introducing new YARN component for Resource management. There are two main ideas with YARN: Provide generic scheduling and resource management. Cours Big Data – résumé fait par Arfaoui A. This way Hadoop can support more than just MapReduce. Provide more efficient scheduling and workload management. • YARN brings significant performance improvements for some applications, supports additional processing models, and implements a more flexible execution engine. • YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1. • YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop. • YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform. • YARN supports multiple processing models in addition to MapReduce • MapReduce has undergone a complete overhaul with YARN, splitting up the two major functionalities of JobTracker (resource management and job scheduling/ monitoring) into separate daemons ResourceManager (RM) o The ResourceManager is the ultimate authority that arbitrates resources among all the applicaCons in the system ApplicationMaster (AM) o It is a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks Hadoop 1.x Job Tracker component is divided into two components: Resource Manager: To manage resources in a cluster Application Master: To manage applications like MapReduce, Spark etc. YARN features Multi-tenancy o YARN allows multiple access engines (either Open-Src or proprietary) to use o Hadoop as the common standard for batch, interactive, and real-time engines that can simultaneously access the same data sets o Multi-tenant data processing improves an enterprise's return on its Hadoop investments. Cluster utilization o YARN's dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions of Hadoop Scalability o Data center processing power continues to rapidly expand. YARN's ResourceManager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data. Compatibility o Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruption to existiting processes that already work Reliability and availability o High availability for the ResourceManager o An application recovery is performed after the restart of ResourceManager o The ResourceManager stores information about running applications and completed tasks in HDFS o If the ResourceManager is restarted, it recreates the state of applications and reruns only incomplete tasks o Highly available NameNode, making the Hadoop cluster much more efficient, powerful, and reliable Mapreduce V1 Cours Big Data – résumé fait par Arfaoui A. HA architecture YARN V2 Hadoop 2 Limitations/disadvantages Minimum runtime Java 7 version Replication is very costly o 3x replication scheme leading to 200% additional storage space and resource overhead. Support for only 2 NameNodes (Active NameNode + StandBy NameNode) o did not provide maximum level of fault tolerance Shell scripts difficult to understand o hadoop developers have to read almost all the shell scripts to understand what is the correct environment variable to set an option and how to set it whether it is java.library.path or java classpath. Hadoop V2 Vs Hadoop V3 How YARN runs applications Summary: Hadoop 3 1. Eraser Encoding • In hadoop V2, the default replication factor is 3: Every piece of data is replicated twice to ensure reliability of 99.999%. ex: 6 blocks will consume 6*3 = 18 blocks of disk space. Replicating the data blocks to 3 data nodes incurs 200% additional storage overhead and network bandwidth when writing data. Solution: Erasure coding (EC) • RAID implements EC through striping, 1) the logically sequential data (such as a file) is divided into smaller units (such as bit, byte, or block) and stores consecutive units on different disks. 2) for each stripe of original data cells, a certain number of parity cells are calculated and stored. This process is called encoding. 3) the error on any striping cell can be recovered through decoding calculation based on surviving data cells and parity cells. • Integrating EC with HDFS can maintain the same fault-tolerance with improved storage efficiency. • ex. (6 data, 3 parity) deployment will only consume 9 blocks (6 data blocks + 3 parity blocks) of disk space —> storage overhead up to 50%. Cours Big Data – résumé fait par Arfaoui A. Erasure codes, also known as forward error correction (FEC) codes. Erasure coding (EC) is a method of data protection, in which data is broken into fragments, expanded and encoded with redundant data pieces and stored across a set of different locations or storage media. Erasure codes are often used instead of traditional RAID because of their ability to reduce the time and overhead required to reconstruct data. The drawback of erasure coding is that it can be more CPU-intensive, and that can translate into increased latency. Erasure coding can be useful with large quantities of data and any applications or systems that need to tolerate failures, such as data grids, distributed storage applications… Methodology: Erasure coding creates a mathematical function to describe a set of numbers so they can be checked for accuracy and recovered if one is lost. Referred to as polynomial interpolation or oversampling, this is the key concept behind erasure codes. In mathematical terms, the protection offered by erasure coding can be represented in simple form by the following equation: n = k + m. o The variable “k” is the original amount of data or symbols. o The variable “m” stands for the extra or redundant symbols that are added to provide protection from failures, also called parity blocks. o The variable “n” is the total number of symbols created after the erasure coding process. Ex: in a 10 of 16 configuration, or EC 10/16, six extra symbols (m) would be added to the 10 base symbols (k). The 16 data fragments (n) would be spread across 16 drives, nodes or geographic locations. The original file could be reconstructed from 10 verified fragments. o YARN version 1 is limited to a single instance of writer/reader and does not scale well beyond small clusters. Version 2 uses a more scalable distributed writer architecture and a scalable backend storage. Enhancing usability by introducing flows and aggregation o in many cases, users are interested in the information at the level of “flows” or logical groups of YARN applications. It is much more common to launch a set or series of YARN applications to complete a logical application. Timeline Service v.2 supports the notion of flows explicitly. o it, also, supports aggregating metrics at the flow level 3. Support for More than 2 NameNodes In Hadoop 2.x,: HDFS NameNode HA architecture has a single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one NameNode. Business critical deployments require higher degrees of fault-tolerance: in Hadoop 3: Allows users to run multiple standby NameNodes. ex. configuring three NameNodes (1 active and 2 passive) and five JournalNodes, the cluster can tolerate the failure of two nodes HDFS Erasure Encoding: Architecture NameNode Extensions – The HDFS files are striped into block groups, which have a certain number of internal blocks. Now to reduce NameNode memory consumption from these additional blocks, a new hierarchical block naming protocol was introduced. The ID of a block group can be deduced from the ID of any of its internal blocks. This allows management at the level of the block group rather than the block. DataNode Extensions – The DataNode runs an additional ErasureCodingWorker (ECWorker) task for background recovery of failed erasure coded blocks. Failed EC blocks are detected by the NameNode, which then chooses a DataNode to do the recovery work. 2. YARN Timeline Service v.2 YARN Timeline Service is developed to address two major challenges: Improving scalability and reliability of Timeline Service Cours Big Data – résumé fait par Arfaoui A. Hortonworks Data Platform, HDP ? 100% open source framework For distributed storage and processing of large, multi-source data sets Centrally architected with YARN at its core Interoperable with existing technology and skills, Enterprise-ready, with data services for operations, governance and security Scoop It is a tool to easily import information from structured databases (MySQL, oracle…) and related Hadoop systems (such as Hive and HBase) into your Hadoop cluster It can be used to extract data from Hadoop and export it to relational databases and enterprise data warehouses It helps offload some tasks such as ETL from Enterprise Data Warehouse to Hadoop for lower cost and efficient execution Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data. Flume helps you aggregate data from many sources, manipulate the data, and then add the data into your Hadoop environment. Its functionality is now superseded by HDF / Apache Nifi. Kafka Kafka Apache is a messaging system used for real-time data pipelines. It is used to build real-time streaming data pipelines that get data between systems or applications. It works with a number variety of Hadoop tools for various applications. Examples of use cases are: o Website activity tracking: capturing user site activities for real-time tracking/ monitoring o Log aggregation: collecting logs from various sources to a central location for processing. o Stream processing: article recommendations based on user activity Hive Apache Hive is a data warehouse system built on top of Hadoop. Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets that are stored in Hadoop. Hive provides SQL on Hadoop: SQL interface, better known as HiveQL or HQL, which allows for easy querying of data in Hadoop Includes HCatalog: Global metadata management layer that exposes Hive table metadata to other Hadoop applications. Pig Apache Pig is a platform for analyzing large data sets. Pig was designed for scripting a long series of data operations (good for ETL) Pig consists of a high-level language called Pig Latin, which was designed to simplify MapReduce programming. Pig's infrastructure layer consists of a compiler that produces sequences of MapReduce programs from this Pig Latin code that you write. The system is able to optimize your code, and "translate" it into MapReduce allowing you to focus on semantics rather than efficiency. HBase Apache HBase is a distributed, scalable, big data store. Use Apache HBase when you need random, real-time read/write access to your Big Data. The goals of the HBase project is to be able to handle very large tables of data running on clusters of commodity hardware. HBase is modeled after Google's BigTable and provides BigTable-like capabilities on top of Hadoop and HDFS. HBase is a NoSQL datastore. HBase is not designed for transactional processing. Accumulo Apache Accumulo is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval. It is based on Google’s BigTable and runs on YARN (“highly secure HBase") Features: o Server-side programming o Designed to scale Cours Big Data – résumé fait par Arfaoui A. o o o Cell-based access control Stable Phoenix Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by combining the best of both worlds: o The power of standard SQL and JDBC APIs with full ACID transaction capabilities. o The flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store. Essentially this is SQL for NoSQL Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and MapReduce Storm Apache Storm is an open source distributed real-time computation system. o Fast o Scalable o Fault-tolerant Used to process large volumes of high-velocity data Useful when milliseconds of latency matter and Spark isn't fast enough o Has been benchmarked at over a million tuples processed per second per node Solr Apache Solr is a fast, open source enterprise search platform built on the Apache Lucene Java search library Full-text indexing and search o REST-like HTTP/XML and JSON APIs make it easy to use with variety of programming languages Highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more Spark Apache Spark is a fast and general engine for large-scale in-memory data processing. It has a number of built-in libraries that sits on top of the Spark core, which takes advantage of all its capabilities. Spark ML, Spark's GraphX, Spark Streaming, Spark SQL and DataFrames. Spark has a variety of advantages including: o Speed: Run programs faster than MapReduce o Easy to use: Write apps quickly with Java, Scala, Python, R o Generality: Can combine SQL, streaming, and complex analytics Runs on variety of environments and can access diverse data sources: o Hadoop, Mesos, standalone, cloud... o HDFS, HBase,… Druid Apache Druid is a high-performance, column-oriented, distributed data store. It has a unique architecture that enables rapid multi-dimensional filtering, adhoc attribute groupings, and extremely fast aggregations It supports real-time streams o Lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high volume data sets o Explore events immediately after they occur It is a datastore designed for business intelligence (OLAP) queries. It integrates with Apache Hive to build OLAP cubes and run sub-seconds queries. Falcon Apache Druid is a high-performance, column-oriented, distributed data store. Framework for managing data life cycle in Hadoop clusters It is data governance engine o Defines, schedules, and monitors data management policies It addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing Atlas Apache Atlas is a scalable and extensible set of core foundational governance services o It enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop It exchanges metadata with other tools and processes within and outside of the Hadoop o Allows integration with the whole enterprise data ecosystem Atlas Features: o Data Classification o Centralized Auditing o Centralized Lineage o Security & Policy Engine Ranger Centralized security framework to enable, monitor and manage comprehensive data security across the Hadoop platform Manage fine-grained access control over Hadoop data access components like Apache Hive and Apache HBase Cours Big Data – résumé fait par Arfaoui A. Ranger console can manage policies for access to files, folders, databases, tables, or column with ease Policies can be set for individual users or groups Ambari For provisioning, managing, and monitoring Apache Hadoop clusters. Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs Ambari REST APIs allows application developers and system integrators to easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications Cloudbreak A tool for provisioning and managing Apache Hadoop clusters in the cloud Policy-based autoscaling on the major cloud infrastructure platforms, including: o Microsoft Azure o Amazon Web Services o Google Cloud Platform o OpenStack Zookeeper Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services All of these kinds of services are used in some form or another by distributed applications o Saves time so you don't have to develop your own It is fast, reliable, simple and ordered Distributed applications can use ZooKeeper to store and mediate updates to important configuration information Zeppelin Apache Zeppelin is a Web-based notebook that enables data-driven, interactive data analytics and collaborative documents Documents can contain SparkSQL, SQL, Scala, Python, JDBC connection, and much more Easy for both end-users and data scientists to work with Notebooks combine code samples, source data, descriptive markup, result sets, and rich visualizations in one place Cours Big Data – résumé fait par Arfaoui A. Data mining Vs Big Data Mining useful information remains difficult for some real-world applications distributed data mining or machine learning Two important differences between data mining and big data: Negative side: Methods for big data analytics are not yet mature Positive side: Some argue that the almost unlimited data make it easier to mine information Advantages of distributed Data Analytics Parallel data loading Reading several TB data from disk is slow Using 100 machines, each has 1/100 data in its local disk ⇒ 1/100 loading time But having data ready in these 100 machines is another issue Fault tolerance Some data replicated across machines: if one fails, others are still available If data is already distributedly stored, it’s not convenient to reduce some to one machine for analysis Disadvantages of distributed Data Analytics More complicated Communication and synchronization Everybody says moving computation to data, but this isn’t that easy! Distributed environment Many easy tasks on one computer become difficult in a distributed environment For example, subsampling is easy on one machine, but may not be in a distributed system Usually, the problem is attributed to slow communication between machines Challenges: Big data, small analysis Vs Big data, big analysis If you need a single record from a huge set, it’s reasonably easy For example, accessing your high-speed rail reservation is fast if you want to analyze the whole set by accessing data several time, it can be much harder Most existing data mining/machine learning methods were designed without considering data access and communication of intermediate results They iteratively use data by assuming they are readily available Example: doing least-square regression isn’t easy in a distributed environment Algorithms for distributed analysis This is an on-going research topic. There are two types of approaches • Parallelize existing (single-machine) algorithms • Design new algorithms particularly for distributed settings • There are things in between Algorithms and systems To have technical breakthroughs for big-data analytics, we should know both algorithms and systems well, and consider them together If you are an expert on both topics, everybody wants you now Many machine learning Ph.D. students don’t know much about systems. But this isn’t the case in the early days of computer science Design consideration of big data algorithms Generally, we have to minimize the data access and communication in a distributed environment It’s possible that method A better than B on one computer but method A worse than B in distributed environments Example: on one computer, often we do batch rather than online learning o Online and streaming learning may be more useful for big-data applications Example: very often we design synchronous parallel algorithms o Maybe asynchronous ones are better for big data? Risks Two problems: Technology limits + Applicability limits Risk: technology limit It’s possible to not get satisfactory results because of the distributed configuration Parallel programming or HPC (high performance computing) wasn’t very successful in early 90’s. But there are two differences this time o We are using commodity machines o Data become the focus Every area has its limitation. The degree of success varies Let’s compare two matrix products: o Dense matrix products: very successful as the final outcome (optimized BLAS) is much better than what ordinary users wrote o Sparse matrix products: not as successful. ordinary code is about as good as those provided by Matlab For big data analytics, it’s too early to tell Risk: applicability limits What’s the percentage of applications that need big-data analytics? o Not clear. o Some think the percentage is small (they think big-data analytics is a hype) One main reason is that you can always analyze a random subset on one machine Big-data analytics is in its infancy It is challenging to develop algorithms and tools in a distributed environment We should take both algorithms and systems into consideration Cours Big Data – résumé fait par Arfaoui A. Text mining Data Mining : l’exploration et l’analyse, par des moyens automatiques ou semiautomatiques, d’un large volume de données afin de découvrir des tendances ou des règles» (M. Berry). Text Mining est l’ensemble des techniques de data mining permettant le traitement des données particulières que sont les données textuelles. Text Mining est le processus d’extraction de structures (connaissances) inconnues, valides et potentiellement exploitables dans les documents textuels, à travers la mise en oeuvre de techniques statistiques ou de machine learning Text Mining est l’ensemble des technologies et méthodes destinées au traitement automatique de données textuelles disponibles sous forme informatique, en assez grande quantité en vue d’en dégager et structurer le contenu dans une perspective d’analyse rapide de découverte d’informations cachées ou de prise automatique de décision Motivation derrière le text mining l’explosion des données: masse importante de données textuelles les capacités de stockage et de calcul offertes par le matériel et les techniques informatiques modernes la recherche en Intelligence Artificielle et en théorie de l’apprentissage Techniques de traitement automatique du langage (TAL) TAL basée sur la linguistique et la sémantique : Grammaires génératives Réseaux sémantiques Représentation du sens par des schémas Gros volume de données Beaucoup de bruit dans les données Besoin d’une expertise forte TAL basées sur la statistique et Machine Learning : Avoir des représentations utiles mais pauvres Compenser par le volume Text mining s’agit d’une approche basée sur la statistique et machine learning Machine learning: données sous format de table/matrice: "attributs-valeurs". Défi: transformer le texte en un tableau/matrice de données adéquat au traitement par les algorithmes du machine learning, en minimisant au possible la perte d’information Représentation vectorielle 1. Nettoyage du texte 2. Normalisation du texte 3. Tokenization : Fractionner une chaîne de caractères en un ensemble de tokens 4. Indexation : a. identification des bag of words : ne tient pas compte de la séquence dans laquelle les mots apparaissent dans un document + Indépendant de la langue b. N-gramme en mots : séquence de n termes adjacents (consécutifs ou apparaissant dans une fenêtre restreinte) que l’on extrait en tant qu’ index : l’idée est que l’association des termes introduit une signification différente de celles qu’ils véhiculent individuellement. c. N-gramme en lettres : séquence de n caractères consécutifs (contigus) que l’on extrait en tant qu’index. 5. Dictionnaire C’est l’ensemble des index (i.e. attributs) apparaissant dans les textes du corpus 6. Pondération Chaque document est représenté par un vecteur. Quelles valeurs mettre dans le tableau? a. Pondération binaire: Comptabiliser la présence de chaque terme dans le document, sans se préoccuper du nombre d’occurrences (de la répétition) Avantages o Simplicité o Forme de « lissage » de l’information en donnant la même importance à tous les termes o Adaptée à certaines techniques (ex. règles d’association) et mesures de distance (ex. Jaccard) Inconvénients o Une partie de l’information n’est pas captée (perte d’information), dont pourrait tirer profit certaines catégories de techniques de ML Cours Big Data – résumé fait par Arfaoui A. o Pourquoi donner la même importance à tous les termes ? b. Fréquence des termes (TF: Term Frequency): Comptabiliser le nombre d’occurrence des termes. Indicateur de l’importance du terme dans le document. Avantages o On capte plus d’information, la répétition d’un terme dans le document est prise en compte o Des techniques savent prendre en compte ce type d’information (calcul matriciel) Inconvénients o Les écarts entre documents sont exagérés (ex. si on utilise une distance euclidienne) Les normalisations des fréquences permettent d’amortir les écarts et/ou de tenir compte de la longueur des documents c. Inverse document frequency: IDF Un terme présent dans presque tout le corpus (D) influe peu quand il apparaît dans un document. A l’inverse, un terme rare apparaissant dans un document doit retenir notre attention. L’IDF mesure l’importance d’un terme dans un corpus d. TF-IDF Relativiser l’importance d’un terme dans un document (TF) par son importance dans le corpus (IDF). La technique fait à la fois référence à un dic`onnaire, et à l’analyse morphosyntaxique des mots Elle est spécifique à chaque langue. Des erreurs sont toujours possibles ! 3. Stemming (racinisation) Le stemming consiste à réduire un mot à sa racine (stem), qui peut ne pas exister. L’algorithme de Porter applique une succession de règles (mécaniques) pour réduire la longueur des mots c.-à-d. supprimer la fin des mots. • Le stemming est un traitement final, qui n’autorise plus de post-traitements sur les mots • Le stemming peut conduire à des regroupements erronés (ex. marmite, marmaille, marm) 4. Filtrage par fréquence Fréquence : nombre de documents où le terme apparaît au moins une fois rapporté au nombre total de documents. Fréquence trop élevée (termes présents dans presque tous les documents) : permet peut-être de cerner le domaine, mais ne permet pas de différencier les documents (ex. databases, image). Fréquence trop faible (termes présents dans de très rares documents) : ne permet pas de caractériser une différence significative entre les documents. Le choix des seuils reste arbitraire. 5. Autres solutions : correcteur orthographique L’outil s’appuie sur un dictionnaire où les termes sont correctement orthographiés. Si le mot à évaluer y est présent OK. S’il n’y est pas, on recense les mots les plus proches. Le ou les plus proches sont proposés. Une mesure spécifique aux chaînes de caractères doit être utilisée. La plus connue est celle de Levensthein (distance d’édition). Attention de ne pas corriger à tort et à travers (ex. les noms propres) 6. Autres solutions : thésaurus Certains mots sont synonymes ou recouvrent le même concept (ex. store vs. save) Représentation vectorielle : La dimensionnalité est souvent très élevée. Le nombre de colonnes peut excéder le nombre de lignes. Problème pour les algorithmes de machine learning, souvent. La réduction de la dimensionnalité est un enjeu crucial. Reduction de la dimensionalité 1. Retrait des stopwords : Un mot vide est un mot communément utilisé dans une langue, non porteur de sens dans un document (ex. préposition, pronoms, etc.). Formellement, sa fréquence d’apparition est la même dans tous les documents. De fait, les mots vides ne permettent pas de discriminer les documents (de distinguer les documents les uns des autres), ils sont inutilisables en text mining tel que nous le concevons (catégorisation, recherche d’information, etc.) 2. Lemmatization La lemmatisation consiste à analyser les termes de manière à identifier sa forme canonique (lemme), qui existe réellement. L’idée est de réduire les différentes formes (pluriel, féminin, conjugaison, etc.) en une seule. Similarité Les mesures de similarité sont nécessaire dans de nombreuses méthodes de data mining (visualisation, classification supervisée et non supervisée). Elles caractérisent les ressemblances entre les objets. Propriétés des mesures de similarité : non-négativité, symétrie, maximalité, normalisation (slide71) Une mesure de dissimilarité caractérise les différences entre les objets. Déduire une dissimilarité à partir d’une similarité Une distance est aussi une mesure de dissimilarité Distance euclidienne Cours Big Data – résumé fait par Arfaoui A. Similarité et distance cosinus : ne s’intéresse qu’aux cooccurrences. La normalisation permet de comparer des documents de longueurs différentes Déduire une distance à partir de la similarité cosinus Indice et distance de Jaccard : adapté à la pondération binaire. S’intéresse aux cooccurrences et bénéficie également d’un mécanisme de normalisation. Déduire une distance à partir de l’indice de Jaccard Cours Big Data – résumé fait par Arfaoui A.