Uploaded by cristhian.ronquillo

IST 04 Data lakes 1

advertisement
IST — Data lakes
Infrastructures pour le stockage et le traitement de données | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Evolution of data management for analytics
■ 1990’s: Data warehouse
■ Companies make heavy use of relational database management systems
(RDBMS) to support business processes. Tens or hundreds of databases.
■ Online transaction processing (OLTP)
■ Data warehouse: Central repository of key data ingested from OLTP
systems. Used for analytic reports (“Which sales channel had the biggest
decline in the last quarter?”). Analytic RDBMS.
■ Online analytical processing (OLAP)
■ Data is well-integrated, highly structured, highly curated, and highly
trusted.
■ 2006: Hadoop and Big Data
■ Exponential growth in semi-structured and unstructured data (mobile and
web applications, sensors, IoT devices, social media, and audio and video
media platforms).
■ Data has high velocity, high volume, high variety
■ Hadoop: open source framework for processing large datasets on clusters
of computers
2
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Evolution of data management for analytics
■ 2011: Data lake
■ Companies have lots of data, suspect that it may be valuable, but don’t
know yet how to extract the value. → Store everything, just in case.
■ Data lake: Company-wide repository for storing data. Store it in raw form.
Don’t try to optimise the storage.
■ Adoption of public cloud infrastructure, namely cloud object stores.
■ Data is unstructured, semi-structured, and structured.
■ 2020: Data lakehouse
■ Companies try to integrate the best ideas from data warehouses and data
lakes.
■ In addition to raw format, data is stored in optimised binary format such as
Parquet.
■ Concurrent reads/writes become possible with Delta Lake file format.
■ 2021: Data mesh
■ The company-wide data lake becomes too big and too complex to
manage. Instead, each division of the company creates its own data lake
that it manages independently.
3
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Enterprise data warehousing architecture
5
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Dimensional
modeling in data
warehouses
Dimension tables are
denormalised.
6
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
Dimension tables
are normalised. Less
duplication.
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Extract-Transform-Load pipelines
Feeding data into the warehouse
8
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Google File System and MapReduce programming model
■ In the beginning of the 2000’s, Google
Application
GFS client
develops its search engine that quickly
dominates market thanks to superior
technology.
9
IST | Data Lakes | Academic year 2023/24
File namespace
Legend:
Chunkserver state
(chunk handle, byte range)
chunk data
Data messages
Control messages
GFS chunkserver
GFS chunkserver
Linux file system
Linux file system
Figure 1: GFS Architecture
User
and replication decisions using global knowledge. However,
tent TCP connection to the chunkserver over an ext
Program
we must minimize its involvement in reads and writes so
period of time. Third, it reduces the size of the me
(1) fork read
that it does not become a bottleneck. Clients never
stored
(1) fork on the master. This allows us to keep the me
(1) fork
and write file data through the master. Instead, a client asks
in memory, which in turn brings other advantages th
the master which chunkservers it should contact. It caches
will discuss in Section 2.6.1.
this information for a limited time and interacts with the
On the other hand, a large chunk size, even with lazy
Master
(2)
chunkservers directly for many subsequent operations.
allocation,
has its disadvantages. A small file consis
assign
(2)
Let us explain the interactions for a simple read with
refersmall
number
of chunks, perhaps just one. The chunks
reduce
assign
ence to Figure 1. First, using the fixed chunk size, themapclient
storing those chunks may become hot spots if many
translates the file name and byte offset specified by the apare accessing the same file. In practice, hot spots ha
worker
plication into a chunk index within the file. Then, it sends
been a major issue because our applications mostly
split 0containing the file name and chunk
large multi-chunk files sequentially.
the master a request
(6) write
output
However, hotworker
spots did develop when
index. The master
with the corresponding chunk
splitreplies
1
file 0 GFS was firs
(5) remote read
by
a
batch-queue
system:
an
executable
was written t
handle and locations
of
the
replicas.
The
client
caches
this
(3)
read
split 2
(4) local write
as a single-chunk file and then started
on
information using the file name andworker
chunk index as the key.
output hundreds
worker
3 a request to one of the replicas,
chines at the same
time. The few chunkservers
storin
The client thensplit
sends
file 1
executable were overloaded by hundreds of simultaneo
most likely the closest
split 4one. The request specifies the chunk
quests. We fixed this problem by storing such execu
handle and a byte range within that chunk. Further reads
with a higher replication factor and by making the
of the same chunk require no more client-master
interaction
worker
queue system stagger application start times. A po
until the cached information expires or the file is reopened.
long-term solution is to allow clients to read data from
In fact, the client typically asks for multiple chunks in the
Inputmaster can also
Map
Intermediate files
Reduce
Output
clients in such situations.
same request and the
include the informafiles
phase
local disks)
phase
files
tion for chunks immediately
following
those requested. (on
This
2.6 Metadata
extra information sidesteps several future client-master in© 2022 HEIG-VD
The master stores three major types of metadata: t
teractions at practically no extra cost.
chunk namespaces, the mapping from files to c
Figure 1: Executionand
overview
2.5 Chunk Size
and the locations of each chunk’s replicas. All metad
MapReduce, a new programming model for
parallel processing that Google uses for the
search engine processing (indexing,
ranking, …).
launched the era of Big Data.
/foo/bar
chunk 2ef0
Instructions to chunkserver
about the Google File System, the distributed
file system that is used to store the data for its
search engine.
■ These papers were very influential and
GFS master
(chunk handle,
chunk locations)
■ In 2003 Google engineers publish a paper
■ In 2004 they follow with a paper on
(file name, chunk index)
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Google File System and MapReduce programming model
■ The Google File System and MapReduce
Memory
programming model had several innovations
over existing high-performance computing
systems:
■ The system is built on inexpensive commodity
hardware. It scales to hundreds or thousands of
machines.
■ The programming model is much simpler than
the shared memory or message passing models.
10
IST | Data Lakes | Academic year 2023/24
P1
P2
P3
P4
P5
Shared memory (pthreads)
P1
P2
P3
P4
P5
Message
passing
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Apache Hadoop
■ In 2005 engineers at Yahoo!, Doug Cutting and
Mike Cafarella, develop a system similar to
Google’s.
■ In 2006 they launch the Apache Hadoop open
source project.
■ A Hadoop installation consists mainly of
Data analysis applications
■ a cluster of machines (physical or virtual)
■ the distributed file system HDFS (Hadoop
Distributed File System)
■ the NoSQL database HBase
■ the distributed computing framework
MapReduce
MapReduce
HBase
database
Hadoop Distributed File System
(HDFS)
■ the data processing applications written by the
developer.
A cluster of machines
■ Hadoop is the name of the toy elephant of
Cutting’s son.
11
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Distribution of data in HDFS
■ When downloading a big file to a MapReduce cluster, the file is distributed over the machines of
the cluster.
■ The file system takes care of dividing the file into pieces (chunks of 64 MB) which are managed
by different machines of the cluster.
■ This is a form of sharding.
A big file
… is divided into
pieces
… and the pieces are
distributed over the
machines of the
cluster.
HDFS node 1
HDFS node 2
HDFS node 3
HDFS node 4
■ (Additionally the chunks are replicated, there are always three copies in the cluster.)
12
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Data processing — Main concept
■ One wishes to process a big volume of data which is distributed over several machines.
■ Traditional approach: move the data to the processing
Node 1
Node 2
Node 3
Node 4
Processing
Result
■ Problem:
■ Data volumes keep growing faster than the performance of data storage.
■ Hard-disks have a relatively low read speed (currently ~100 MB/second magnetic, ~550 MB/second
SSD)
■ Reading a copy of the Web (> 400 TB) would need more than a week (SSD)!
13
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Data processing — Main concept
■ MapReduce approach: move the processing to the data
■ Each machine which stores data executes a piece of the processing.
■ Partial results are collected and aggregated.
Node 1
Partial
result
Node 2
■ Advantages
Node 3
Node 4
Result
■ Less movement of data on the network.
■ Processing takes place in parallel on several machines.
■ Process a copy of the Web using 1'000 machines: < 3 hours
14
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Distributed computing platform
■ The MapReduce concept is a simple data processing model that can be applied to many
problems:
■ Google: compute PageRank which determines the importance of a Web page.
■ Last.fm: compute charts of the most listened songs and recommendations (music you might like).
■ Facebook: compute usage statistics (user growth, visited pages, time spent by users) and
recommendations (people you might know, applications you might like).
■ Rackspace: indexing of infrastructure logs to determine root cause in case of failures.
■ ...
■ To implement the model one needs to
■ parallelize the compute tasks
■ balance the load
■ optimize disk and memory transfers
■ manage the case of a failing machine
■ ...
■ A distributed computing platform is needed!
15
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Map and Reduce functions — Origin of the terms
■ The terms Map and Reduce come from Lisp
■ When you have a list you can apply at once the
same function to every element of the list. You
obtain another list.
input list
1
2
3
4
5
6
7
8
output list
1
4
9
16
25
36
49
64
input list
1
2
3
4
5
6
7
8
Map function
■ For example the function x → x2
■ You can also apply at once a function which
reduces the elements of a list to a single value.
■ For example the sum function
■ In Hadoop, the functions Map and Reduce are
more general.
Reduce
function
output value
16
IST | Data Lakes | Academic year 2023/24
36
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Example: Processing of meteorological data
■ The National Climatic Data Center of the
United States publishes meteorological data
■ Captured by tens of thousands meteorological
stations
■ Measures: temperature, humidity, precipitation,
wind, visibility, pression, etc.
■ Historical data available since the beginning of
meteorological measurements
■ The data is available as text files.
■ Example file:
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
...
17
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Example: Processing of meteorological data
■ Each file contains the measures of a year.
■ One line represents a set of observations of a station a certain point in time.
■ Example line with comments (distributed over several lines for better legibility):
0057
332130
# USAF weather station identifier
99999
# WBAN weather station identifier
19500101 # observation date
0300
# observation time
4
+51317
# latitude (degrees x 1000)
+028783 # longitude (degrees x 1000) FM-12
+0171
# elevation (meters)
99999
V020
320
# wind direction (degrees)
1
# quality code
N
0072
1
00450
1
C
N
010000
1
N
9
-0128
1
-0139
1
10268
1
# sky ceiling height (meters)
# quality code
# visibility distance (meters)
# quality code
# air temperature (degrees Celsius x 10)
# quality code
# dew point temperature (degrees Celsius x 10)
# quality code
# atmospheric pressure (hectopascals x 10)
# quality code
Source: Tom White, Hadoop: The Definitive Guide
18
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Example: Processing of meteorological data
■ Problem: One wishes to calculate for each year the maximum temperature
■ Classical approach
■ Bash / Awk script
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
■ Computing time for the data from 1901 to 2000: 42 minutes
...
Source: Tom White, Hadoop: The Definitive Guide
19
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
MapReduce programming model
Example: Processing of meteorological data
■ MapReduce approach
106
0043011990999991950051512004...9999999N9+00221+99999999999...
■ The developer writes two functions
■ The Mapper which is responsible for extracting
Mapper
the year and the temperature from a line.
■ The Reducer which is responsible for calculating
the maximum temperature.
■ Hadoop is responsible for
1950
22
0
22
■ Dividing the input files into pieces,
■ Instantiating the Mapper on each machine of the
cluster an run the instances,
1950
-11
■ Collecting the results of the Mapper instances,
■ Instantiating the Reducer on each machine of the
cluster and run the instances by giving them as
input the data produced by the Mapper
instances.
■ Store the results of the Reducer instances.
20
IST | Data Lakes | Academic year 2023/24
Reducer
1950
22
© 2022 HEIG-VD
0
0067011990999991950051507004...9999999N9+00001+99999999999...
106
0043011990999991950051512004...9999999N9+00221+99999999999...
212
318
424
Mapper
1950
0
The meteorological data is divided
into lines.
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
Mapper
1950
Mapper
22
1950
Mapper
-11
1949
111
111
Reducer
1949
111
78
1950
1949
78
The intermediate data is grouped
by key (the year) and sorted.
Shuffle and sort
1949
Mapper
The Mapper extracts the year and
the temperature and writes a keyvalue pair (year, temperature) as
output.
0
22
Reducer
1950
-11
The Reducer reads the year and all
temperatures of that year. It
calculates the maximum and writes
a key-value pair (year, maximum
temperature) as output.
22
■ Processing time for the data from 1901 to 2000 using 10 machines: 6 minutes
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Anatomy of a Hadoop cluster
Distributed file system HDFS
■ HDFS design decisions:
■ Files stored as chunks
■ Fixed size (64MB)
■ Reliability through
HDFS namenode
Application
File namespace
/foo/bar
block 3d2f
HDFS client
...
replication
...
■ Each chunk replicated across 3+ nodes
■ Single master to coordinate access, keep
metadata
■ Simple centralized management
■ No data caching
■ Little benefit due to large datasets, streaming
reads
■ Simplify the API
HDFS datanode
HDFS datanode
Linux file system
Linux file system
...
...
■ Push some of the issues onto the client (e.g., data
layout)
22
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Anatomy of a Hadoop cluster
Namenode responsibilities
■ Managing the file system namespace:
■ Holds file/directory structure, metadata, file-to-
block mapping, access permissions, etc.
■ Coordinating file operations:
■ Directs clients to datanodes for reads and writes
■ No data is moved through the namenode
■ Maintaining overall health:
■ Periodic communication with the datanodes
■ Block re-replication and rebalancing
■ Garbage collection
23
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Anatomy of a Hadoop 1.x cluster
Putting everything together
■ Per cluster:
master node
■ One Namenode (NN): master
node for HDFS
■ One Jobtracker (JT): master
jobtracker
Web UI at
http://hostname:50030/
Web UI at
http://hostname:50070/
Server
MapReduce
HDFS
namenode
node for job submission
■ Per slave machine:
■ One Tasktracker (TT):
contains multiple task slots
slave node
slave node
slave node
■ One Datanode (DN): serves
tasktracker
tasktracker
tasktracker
datanode
datanode
datanode
HDFS data blocks
24
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Apache Hadoop ecosystem
The most important projects
■ HDFS (2006) — Distributed file system for
big data
■ MapReduce (2006) — Framework
implementing MapReduce programming
model for parallel processing
■ ZooKeeper (2007) — Distributed key-value
store for the coordination of distributed
applications
■ HBase (2008) — NoSQL database on top of
HDFS
schema information and other metadata
■ YARN (2012)— Generic cluster manager
■ Impala (2013) — Massively parallel
processing (MPP) SQL query engine
■ Spark (2014) — Analytics engine for large-
scale data processing written in Scala.
Offers multiple programming models:
■ Scala Resilient Distributed Datasets (RDDs)
■ Scala DataFrames
■ Pig (2008) — Programming model for
parallel processing that is higher-level than
MapReduce
■ Hive (2010) — Software project for data
warehouses that gives an SQL-like interface
for querying data. Uses Pig.
25
■ Hive metastore — RDBMS for storing
IST | Data Lakes | Academic year 2023/24
■ Python DataFrames
■ SQL
■ Ozone (2020) — HDFS-compatible object
store optimised for billions of small files
© 2022 HEIG-VD
HEIG-VD | TIC – Technologies de l’Information et de la Communication
Data lake logical architecture
27
IST | Data Lakes | Academic year 2023/24
© 2022 HEIG-VD
Download