IST — Data lakes Infrastructures pour le stockage et le traitement de données | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Evolution of data management for analytics ■ 1990’s: Data warehouse ■ Companies make heavy use of relational database management systems (RDBMS) to support business processes. Tens or hundreds of databases. ■ Online transaction processing (OLTP) ■ Data warehouse: Central repository of key data ingested from OLTP systems. Used for analytic reports (“Which sales channel had the biggest decline in the last quarter?”). Analytic RDBMS. ■ Online analytical processing (OLAP) ■ Data is well-integrated, highly structured, highly curated, and highly trusted. ■ 2006: Hadoop and Big Data ■ Exponential growth in semi-structured and unstructured data (mobile and web applications, sensors, IoT devices, social media, and audio and video media platforms). ■ Data has high velocity, high volume, high variety ■ Hadoop: open source framework for processing large datasets on clusters of computers 2 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Evolution of data management for analytics ■ 2011: Data lake ■ Companies have lots of data, suspect that it may be valuable, but don’t know yet how to extract the value. → Store everything, just in case. ■ Data lake: Company-wide repository for storing data. Store it in raw form. Don’t try to optimise the storage. ■ Adoption of public cloud infrastructure, namely cloud object stores. ■ Data is unstructured, semi-structured, and structured. ■ 2020: Data lakehouse ■ Companies try to integrate the best ideas from data warehouses and data lakes. ■ In addition to raw format, data is stored in optimised binary format such as Parquet. ■ Concurrent reads/writes become possible with Delta Lake file format. ■ 2021: Data mesh ■ The company-wide data lake becomes too big and too complex to manage. Instead, each division of the company creates its own data lake that it manages independently. 3 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Enterprise data warehousing architecture 5 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Dimensional modeling in data warehouses Dimension tables are denormalised. 6 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD Dimension tables are normalised. Less duplication. HEIG-VD | TIC – Technologies de l’Information et de la Communication Extract-Transform-Load pipelines Feeding data into the warehouse 8 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Google File System and MapReduce programming model ■ In the beginning of the 2000’s, Google Application GFS client develops its search engine that quickly dominates market thanks to superior technology. 9 IST | Data Lakes | Academic year 2023/24 File namespace Legend: Chunkserver state (chunk handle, byte range) chunk data Data messages Control messages GFS chunkserver GFS chunkserver Linux file system Linux file system Figure 1: GFS Architecture User and replication decisions using global knowledge. However, tent TCP connection to the chunkserver over an ext Program we must minimize its involvement in reads and writes so period of time. Third, it reduces the size of the me (1) fork read that it does not become a bottleneck. Clients never stored (1) fork on the master. This allows us to keep the me (1) fork and write file data through the master. Instead, a client asks in memory, which in turn brings other advantages th the master which chunkservers it should contact. It caches will discuss in Section 2.6.1. this information for a limited time and interacts with the On the other hand, a large chunk size, even with lazy Master (2) chunkservers directly for many subsequent operations. allocation, has its disadvantages. A small file consis assign (2) Let us explain the interactions for a simple read with refersmall number of chunks, perhaps just one. The chunks reduce assign ence to Figure 1. First, using the fixed chunk size, themapclient storing those chunks may become hot spots if many translates the file name and byte offset specified by the apare accessing the same file. In practice, hot spots ha worker plication into a chunk index within the file. Then, it sends been a major issue because our applications mostly split 0containing the file name and chunk large multi-chunk files sequentially. the master a request (6) write output However, hotworker spots did develop when index. The master with the corresponding chunk splitreplies 1 file 0 GFS was firs (5) remote read by a batch-queue system: an executable was written t handle and locations of the replicas. The client caches this (3) read split 2 (4) local write as a single-chunk file and then started on information using the file name andworker chunk index as the key. output hundreds worker 3 a request to one of the replicas, chines at the same time. The few chunkservers storin The client thensplit sends file 1 executable were overloaded by hundreds of simultaneo most likely the closest split 4one. The request specifies the chunk quests. We fixed this problem by storing such execu handle and a byte range within that chunk. Further reads with a higher replication factor and by making the of the same chunk require no more client-master interaction worker queue system stagger application start times. A po until the cached information expires or the file is reopened. long-term solution is to allow clients to read data from In fact, the client typically asks for multiple chunks in the Inputmaster can also Map Intermediate files Reduce Output clients in such situations. same request and the include the informafiles phase local disks) phase files tion for chunks immediately following those requested. (on This 2.6 Metadata extra information sidesteps several future client-master in© 2022 HEIG-VD The master stores three major types of metadata: t teractions at practically no extra cost. chunk namespaces, the mapping from files to c Figure 1: Executionand overview 2.5 Chunk Size and the locations of each chunk’s replicas. All metad MapReduce, a new programming model for parallel processing that Google uses for the search engine processing (indexing, ranking, …). launched the era of Big Data. /foo/bar chunk 2ef0 Instructions to chunkserver about the Google File System, the distributed file system that is used to store the data for its search engine. ■ These papers were very influential and GFS master (chunk handle, chunk locations) ■ In 2003 Google engineers publish a paper ■ In 2004 they follow with a paper on (file name, chunk index) HEIG-VD | TIC – Technologies de l’Information et de la Communication Google File System and MapReduce programming model ■ The Google File System and MapReduce Memory programming model had several innovations over existing high-performance computing systems: ■ The system is built on inexpensive commodity hardware. It scales to hundreds or thousands of machines. ■ The programming model is much simpler than the shared memory or message passing models. 10 IST | Data Lakes | Academic year 2023/24 P1 P2 P3 P4 P5 Shared memory (pthreads) P1 P2 P3 P4 P5 Message passing © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Apache Hadoop ■ In 2005 engineers at Yahoo!, Doug Cutting and Mike Cafarella, develop a system similar to Google’s. ■ In 2006 they launch the Apache Hadoop open source project. ■ A Hadoop installation consists mainly of Data analysis applications ■ a cluster of machines (physical or virtual) ■ the distributed file system HDFS (Hadoop Distributed File System) ■ the NoSQL database HBase ■ the distributed computing framework MapReduce MapReduce HBase database Hadoop Distributed File System (HDFS) ■ the data processing applications written by the developer. A cluster of machines ■ Hadoop is the name of the toy elephant of Cutting’s son. 11 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Distribution of data in HDFS ■ When downloading a big file to a MapReduce cluster, the file is distributed over the machines of the cluster. ■ The file system takes care of dividing the file into pieces (chunks of 64 MB) which are managed by different machines of the cluster. ■ This is a form of sharding. A big file … is divided into pieces … and the pieces are distributed over the machines of the cluster. HDFS node 1 HDFS node 2 HDFS node 3 HDFS node 4 ■ (Additionally the chunks are replicated, there are always three copies in the cluster.) 12 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Data processing — Main concept ■ One wishes to process a big volume of data which is distributed over several machines. ■ Traditional approach: move the data to the processing Node 1 Node 2 Node 3 Node 4 Processing Result ■ Problem: ■ Data volumes keep growing faster than the performance of data storage. ■ Hard-disks have a relatively low read speed (currently ~100 MB/second magnetic, ~550 MB/second SSD) ■ Reading a copy of the Web (> 400 TB) would need more than a week (SSD)! 13 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Data processing — Main concept ■ MapReduce approach: move the processing to the data ■ Each machine which stores data executes a piece of the processing. ■ Partial results are collected and aggregated. Node 1 Partial result Node 2 ■ Advantages Node 3 Node 4 Result ■ Less movement of data on the network. ■ Processing takes place in parallel on several machines. ■ Process a copy of the Web using 1'000 machines: < 3 hours 14 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Distributed computing platform ■ The MapReduce concept is a simple data processing model that can be applied to many problems: ■ Google: compute PageRank which determines the importance of a Web page. ■ Last.fm: compute charts of the most listened songs and recommendations (music you might like). ■ Facebook: compute usage statistics (user growth, visited pages, time spent by users) and recommendations (people you might know, applications you might like). ■ Rackspace: indexing of infrastructure logs to determine root cause in case of failures. ■ ... ■ To implement the model one needs to ■ parallelize the compute tasks ■ balance the load ■ optimize disk and memory transfers ■ manage the case of a failing machine ■ ... ■ A distributed computing platform is needed! 15 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Map and Reduce functions — Origin of the terms ■ The terms Map and Reduce come from Lisp ■ When you have a list you can apply at once the same function to every element of the list. You obtain another list. input list 1 2 3 4 5 6 7 8 output list 1 4 9 16 25 36 49 64 input list 1 2 3 4 5 6 7 8 Map function ■ For example the function x → x2 ■ You can also apply at once a function which reduces the elements of a list to a single value. ■ For example the sum function ■ In Hadoop, the functions Map and Reduce are more general. Reduce function output value 16 IST | Data Lakes | Academic year 2023/24 36 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Example: Processing of meteorological data ■ The National Climatic Data Center of the United States publishes meteorological data ■ Captured by tens of thousands meteorological stations ■ Measures: temperature, humidity, precipitation, wind, visibility, pression, etc. ■ Historical data available since the beginning of meteorological measurements ■ The data is available as text files. ■ Example file: 0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999 0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999 0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999 0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999 ... 17 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Example: Processing of meteorological data ■ Each file contains the measures of a year. ■ One line represents a set of observations of a station a certain point in time. ■ Example line with comments (distributed over several lines for better legibility): 0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 +51317 # latitude (degrees x 1000) +028783 # longitude (degrees x 1000) FM-12 +0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N 0072 1 00450 1 C N 010000 1 N 9 -0128 1 -0139 1 10268 1 # sky ceiling height (meters) # quality code # visibility distance (meters) # quality code # air temperature (degrees Celsius x 10) # quality code # dew point temperature (degrees Celsius x 10) # quality code # atmospheric pressure (hectopascals x 10) # quality code Source: Tom White, Hadoop: The Definitive Guide 18 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Example: Processing of meteorological data ■ Problem: One wishes to calculate for each year the maximum temperature ■ Classical approach ■ Bash / Awk script #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`"\t" gunzip -c $year | \ awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' done % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ■ Computing time for the data from 1901 to 2000: 42 minutes ... Source: Tom White, Hadoop: The Definitive Guide 19 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication MapReduce programming model Example: Processing of meteorological data ■ MapReduce approach 106 0043011990999991950051512004...9999999N9+00221+99999999999... ■ The developer writes two functions ■ The Mapper which is responsible for extracting Mapper the year and the temperature from a line. ■ The Reducer which is responsible for calculating the maximum temperature. ■ Hadoop is responsible for 1950 22 0 22 ■ Dividing the input files into pieces, ■ Instantiating the Mapper on each machine of the cluster an run the instances, 1950 -11 ■ Collecting the results of the Mapper instances, ■ Instantiating the Reducer on each machine of the cluster and run the instances by giving them as input the data produced by the Mapper instances. ■ Store the results of the Reducer instances. 20 IST | Data Lakes | Academic year 2023/24 Reducer 1950 22 © 2022 HEIG-VD 0 0067011990999991950051507004...9999999N9+00001+99999999999... 106 0043011990999991950051512004...9999999N9+00221+99999999999... 212 318 424 Mapper 1950 0 The meteorological data is divided into lines. 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... Mapper 1950 Mapper 22 1950 Mapper -11 1949 111 111 Reducer 1949 111 78 1950 1949 78 The intermediate data is grouped by key (the year) and sorted. Shuffle and sort 1949 Mapper The Mapper extracts the year and the temperature and writes a keyvalue pair (year, temperature) as output. 0 22 Reducer 1950 -11 The Reducer reads the year and all temperatures of that year. It calculates the maximum and writes a key-value pair (year, maximum temperature) as output. 22 ■ Processing time for the data from 1901 to 2000 using 10 machines: 6 minutes HEIG-VD | TIC – Technologies de l’Information et de la Communication Anatomy of a Hadoop cluster Distributed file system HDFS ■ HDFS design decisions: ■ Files stored as chunks ■ Fixed size (64MB) ■ Reliability through HDFS namenode Application File namespace /foo/bar block 3d2f HDFS client ... replication ... ■ Each chunk replicated across 3+ nodes ■ Single master to coordinate access, keep metadata ■ Simple centralized management ■ No data caching ■ Little benefit due to large datasets, streaming reads ■ Simplify the API HDFS datanode HDFS datanode Linux file system Linux file system ... ... ■ Push some of the issues onto the client (e.g., data layout) 22 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Anatomy of a Hadoop cluster Namenode responsibilities ■ Managing the file system namespace: ■ Holds file/directory structure, metadata, file-to- block mapping, access permissions, etc. ■ Coordinating file operations: ■ Directs clients to datanodes for reads and writes ■ No data is moved through the namenode ■ Maintaining overall health: ■ Periodic communication with the datanodes ■ Block re-replication and rebalancing ■ Garbage collection 23 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Anatomy of a Hadoop 1.x cluster Putting everything together ■ Per cluster: master node ■ One Namenode (NN): master node for HDFS ■ One Jobtracker (JT): master jobtracker Web UI at http://hostname:50030/ Web UI at http://hostname:50070/ Server MapReduce HDFS namenode node for job submission ■ Per slave machine: ■ One Tasktracker (TT): contains multiple task slots slave node slave node slave node ■ One Datanode (DN): serves tasktracker tasktracker tasktracker datanode datanode datanode HDFS data blocks 24 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Apache Hadoop ecosystem The most important projects ■ HDFS (2006) — Distributed file system for big data ■ MapReduce (2006) — Framework implementing MapReduce programming model for parallel processing ■ ZooKeeper (2007) — Distributed key-value store for the coordination of distributed applications ■ HBase (2008) — NoSQL database on top of HDFS schema information and other metadata ■ YARN (2012)— Generic cluster manager ■ Impala (2013) — Massively parallel processing (MPP) SQL query engine ■ Spark (2014) — Analytics engine for large- scale data processing written in Scala. Offers multiple programming models: ■ Scala Resilient Distributed Datasets (RDDs) ■ Scala DataFrames ■ Pig (2008) — Programming model for parallel processing that is higher-level than MapReduce ■ Hive (2010) — Software project for data warehouses that gives an SQL-like interface for querying data. Uses Pig. 25 ■ Hive metastore — RDBMS for storing IST | Data Lakes | Academic year 2023/24 ■ Python DataFrames ■ SQL ■ Ozone (2020) — HDFS-compatible object store optimised for billions of small files © 2022 HEIG-VD HEIG-VD | TIC – Technologies de l’Information et de la Communication Data lake logical architecture 27 IST | Data Lakes | Academic year 2023/24 © 2022 HEIG-VD