Uploaded by compte co

AWS Certified Data Analytics Notes 2

advertisement
AWS Certified Data Analytics Notes
Building a data lake on AWS:
1. Choose the right storage:
 Amazon S3: it’s the central pillar of AWS data lake offer is secure and highly scalable
and offers durable object storage with milliseconds latency and allows to store any
type of data. S3 integrate a storage lifecycle management
2. Data ingestion: move data into the data lake
 It depends on volume, variety and velocity
 If it btw the source data center and data lake: might need direct connector  use AWS
direct Connect (data collection service)
 Moving large amounts of data (terabytes): use AWS snowball
 Multiple terabytes of data : use AWS CLI over an internet connection
 Perabyte or exabytes: snowmobile
 Move data from SQL data base engine : AWS Glue and AWS database migration
3. Cleanse, prep and catalog the data
 GLUE to schedule ETL jobs
 Use Athena, EMR and Redshift to create and reuse data catalog and AWS lake
formation
4. Secure the data and metadata
Use IAM user policies or KMS to enable client and server-side encryption
5. Make data available for analytics
Using lake formation to build a data lake
Data Collection
The collection phase of the data analytics pipeline focuses on ingesting raw data from transactions,
logs, and Internet of Things (IoT) devices. A good data analysis solution allows developers to ingest a
wide variety of data - structured, semistructured, and unstructured - at any speed, from batch to
streaming.
This domain consists of three subdomains:
1.1: Determine the operational characteristics of the collection system
1.2: Select a collection system that handles the frequency, volume, and source of data
1.3: Select a collection system that addresses the key properties of data, such as order, format, and
compression
Amazon Kinesis Data Streams ingests and store data streams for processing
*Streaming Data: Also known as event stream processing, streaming data is the
continuous flow of data generated by various sources.  You can use Amazon
Kinesis Data Streams to collect and process large streams of data records in real
time.
Kinesis Producers: producer send data to kinesis we do a put record
Kinesis SDK: PutRecords API (sending only latest data) you send one or many records
(batching and increases throughput  less HTTP requests)
USE CASE: LOW THROUGHPUT, HIGHER LATENCY, SIMPLE API, AWS LAMDA
 Kinesis Producer Library (KPL) easy to use highly configurable C++ / Java library
(used for building high performance long-running producers)  send data
synchronously or asynchronously
Batching: aggregate into one record by introducing some delay with
RecordMaxBufferdTime (default 100ms)
We cannot use KPL if an app cannot tolerate the additional delay, we can use the AWS
SDK directly
USE CASE: WRITES TO ONE OR MORE KINESIS DATA STREAMS, AGGREGATES USER
RECORDS TO INCREASE PAYLOAD SIZE AND IMPROVE THROUGHPUT, COLLECTS
RECORDS AND USES PUTRECORDS TO WRITE MULTIPLE RECORDS TO MULTIPLE
SHARDS PER REQUEST
 AWS Kinesis API: provisioned throughput exceeded exceptions
 Kinesis Agent: Monitor log files and sends them to Kinesis Data Streams. It’s a javabased agent built on top of KPL
 Apache Spark (3rd party libraries)
 Kafka (3rd party libraries)
 Managed AWS sources for kinesis data streams: CloudWatch logs, AWSIoT, Kinesis data
analytics

Kinesis Consumers:

Kinesis SDK: records are polled by consumer from a shard (using get record API call)
The GetRecords returns up to 10 MB of data or up to 10000 records with 5 calls per
shard per second == 200 ms latency






Kinesis Client Library (KCL): read records with de-aggregation, share multiple
consumers in one group  shared discovery, leverage DynamoDB for coordination
and checkpointing
Kinesis Connector Library (KCL): used to write data to S3, DynamoDB, Redshift,
elasticsearch service and it must be running on EC2 (Kinesis firehose replace it)
3rd party libraries
Kinesis Firehose
AWS Lambda: can source records from kinesis data streams it has a library to deaggregate record from KPL. It cab be used to run lightweight ETL to anywhere you
want. It’s used to trigger notification/send emails in real time
Kinesis Enhanced Fan Out: used in case you have multiple consumer applications for
the same stream and low latency requirement. each consumer get 2 MB/s of
provisioned throughput per shard it pushes data to consumers over HTTO/2 with 70
ms reduced latency with limit of 5 consumers
SCALE KINESIS //




by adding shards  shard splitting (splitting hard shard) one shard write 1MB per
second (1000 records per second) and 2 MB per second
Merging shards  decrease the stream capacity and save costs used in case of two
shards with low traffic
Resharding  you can read data from child shards, however data you haven’t read yet
could still be in the parent /// after a reshared read entirely from the parent until you
don’t have new records
Autoscaling  is not a native feature of kinesis it’s implemented with AWS Lambda
Resharding cannot be done in parallel  scaling in kinesis is not instantaneous it takes time
Handling Duplicates
Producers can create duplicates due to network timeouts using a PutRecord/Retry
PutRecord although the two records have identical data they also have unique
sequence numbers  fix: embed unique record ID in the data to de-duplicate on
the consumer side
 Consumer retries can make your application read the same data twice when record
processors restart:
- A worker terminates unexpectedly
- Worker instances are added or removed
- Shards are merged or split
- The application is deployed
 Fix: make consumer app idempotent / if the final destination can handle duplicates it’s
recommended to do it there

Kinesis Security:





Control access/ authorization using IAM policies
Encryption in flight using HTTPS endpoints
Encryption at rest using KMS
Client-side encryption
VPC endpoint
AWS Kinesis Data Firehose KDF load data into amazon data store  used to
deliver real-time streaming data to destinations (S3, Redshift, Elasticsearch,
or Splunk)
Fully managed service, does not require any administration || near real time (60 s latency min for non
full batches). used to load data into Redshift/ Amazon S3/ ElasticSearch / Splunk. With automatic
scaling. Data conversion from JSON to Parquet/ORC for S3 || data transformation through AWS
lambda.  pay only for the amount of data going through firehose
Spark/KCL DO NOT READ FROM KDF
Firehose Buffer Sizing: firehose accumulates records in a buffer  the buffer is flushed based on time
and size rules (need to define buffer size and buffer time) for high throughput buffer size will be hit /
low throughput buffer time will be hit (minimum 1 min/60 seconds)
CloudWatch logs: are real time streaming logs and they can be stream into three destinations (KDS,
KDF, Lambda) using CloudWatch Subscriptions filters with the AWS CLI.
SQS is a Queue  producers send messages to the queue and consumers pull messages from it.
Default retention of messages is 4 days max 14 days.
AWS IoT: is a managed platform that enables you to connect IoT devices to various AWS services and
other devices in secure manner. it’s not a single service but rather provides the complete ecosystem to
build, monitor and analyze an IoT platform  The need to connect the devices and collect, store and
analyze the data being generated.
The three biggest challenges in the IoT world:



Connectivity to devices, data collection, and performing actions on the devices
Controlling, managing, and securing large fleets of devices
Extracting the insights from the data being generated by the devices
Device Software:


Amazon FreeRTOS (an IoT operating system for microcontrollers) and
AWS IoT Greengrass, which extends AWS to the edge devices so that they can act locally on
the data they generate while still using the cloud for management, analytics, and storage.
The control services include




AWS IoT Core, which is responsible for securing the device connectivity and messaging
AWS IoT Device Management to onboard fleets of devices and provides management and
software updates to the fleet
AWS IoT Device Defender, which is responsible for fleet audit protection; and
AWS IoT Things Graph, which is responsible to connect devices and web services.
IoT core:
Device gateway (could have a msg broker) helps to communicate with the AWS cloud
Rule engine contains a bench of rules that you can define allowing you to modify the behavior of you
devices CAN ALSO Define actions to send data to many different targets within AWS
Device shadow is a shadow of the device in the case of internet cuts
IoT Device Gateway: it’s an entry point for IoT devices connecting to AWS and allows devices to
securely and efficiently communicate with AWS IoT support MQTT, WebSockets and HTTP 1.1
protocols; It’s fully managed and scales automatically to support over a billion of devices
IoT Message Broker pub/sub (publishers/subscribers) messaging pattern. Devices can communicate
with one another with AWS IoT support MQTT, WebSockets and HTTP 1.1. it send messages to all
clients connected to the topic
IoT Thing Registry: all connected IoT devices are represented in AWS IoT registry. Organize the
resources associated with each device in the AWS Cloud. Each device gets a unique ID. Supports
metadata for each device
DMS: quickly and securely migrate databases from on-premise to AWS, it’s resilient and self-healing.
Used for Homogeneous and heterogeneous migrations with a continuous data replication using CDC
Usually we need to us SCT (schema converging tool) but we do not need it if we are migrating the
same DB engine
Direct Connect (DX): provide a dedicated private connection from a remote network to your VPC
need to setup a virtual private gateway on the VPC. Use Direct connect gateway
AWS Snow Family: transfer data that could take more than a week to transfer over the network (an
offline devices to perform data migrations) transfer data to AWS trough a physical root not network
root  use OpsHub to manage snow family devices
Managed Streaming for Apache Kafka MSK: fully manage Apache Kafka on AWS can create custom
configurations for your clusters the default msg size is 1MB but it’s possible to send large msgs (ex 10
MB) into Kafka after custom configuration (data is stored on EBS volumes)
The producers (code) write data to a kafka Topic and consumers Poll from topic
Security in Kafka:
-
we have Encryption in flight using TLS btw the brokers | btw the clients and
brokers | at rest for your EBS volume using KMS
network security: authorize specific groups for your Kafka clients
Authentication & authorization: define who can write/read to which topic | IAM
Access control
Monitoring
MSK vs Kinesis
CloudWatch metrics
Prometheus
Broke log Delivery
Data Storage
AWS S3: bucket must have a globally unique name and it’s defined at the region level – Objects (Files)
have a key (full path) with max size is 5TB for more than 5GB must use multi-part-upload we can use
Tags (useful for security and lifecycle) and versioning.  All operations are consistence
S3 replication: CRR (cross region replication)/SRR (same region replication)
-
Must enable versioning in source and destination
Copying can be asynchronous
Must give proper IAM permission to S3
After activating replication, only new objects are replicated (not retroactive)
-
SSE-S3: encryption using keys handled & managed by Amazon S3, service side
encryption (SSE) and must set a header
SSE-KMS: encryption using keys handled & managed by KMS
SSE-C using data keys fully managed by the customer outside of AWS (S3
does not store the encryption key you provide)
Client-Side Encryption
Encryption:
-
S3 Select & Glacier Select: It’s an API
-
Retrieve less data using SQL by performing Server-side filtering
Can filter by rows & columns (simple sql statements)
Less network transfer, less CPU cost client-side
Glacier select can only do uncompressed CSV files
S3 Event Notifications: You can create event notification rules
-
SNS
SQS
Lambda function
DynamoDB: good place to store data that needs to be write very often


is fully managed database, highly available with replication across 3 AZ
It’s a NoSQL database (not a relational database)
Scale to massive workloads, distributed database
Millions of requests per seconds, trillions of rows, 100s of TB of storage
Fast and consistent in performance (low latency on retrieval
Integrated with AIM for security, authorization and administration
Low cost and autoscaling capabilities
Basics: DynamoDB is made of tables, each table has a primary key (must be decided at
creation time), each table can have an infinite number of items (rows), each item has an
attributes (can be added over time-can be null) max size of an item is 400KB
Primary keys:
Provisioned throughput: provision read (RCU read capacity units) and write (WCU write
capacity units) capacity units
- WCU write capacity units represents one write per second for an item up to 1KB
in size. If items are larger than 1KB more WCU are consumed (for 10 objects per sd
of 2 KB each we need 2*10 = 20 WCU | for 6 objects per sd of 4,5 KB each we need
6*5 = 30 WCU “4,5 get rounded to the upper KB” | for 120 objects per minute of
2KB we need 120/60 * 2 = 4WCU)
- RCU read capacity units there is two types Eventually Consistent Read (if we
read just after a write  it’s possible we’ll get unexpected response because of
replication) and Strongly Consistent Read (if we read just after a write we will get
the correct data). By default DynamoDB is distributed database and uses
Eventually Consistent Read  One read capacity unit represents one strangly
consistent red per second or two eventually consistent reads pe second for an
atem up to 4KB in size
- Throttling: if we exceed ou RCU or WCU we get Provisioned Throughput
Exceeded Exceptions in the case of Hot keys/partitions (one partition is being read
too many times  solution 1- use exponential back-off (in SDK) 2- Distribuete
partition keys as much as possible 3- use DynamoDB Accelerator (DAX)
- Partitions Internal: we start with on partition each partition have a max of 3000
RCU / 1000 WCU and max of 10GB || WCU and RCU are spread evenly between
partitions
 DynamoDB APIs:
- Writing data: PutItem | UpdateItem | Conditional Writes
 Indexes: LSI and GSI
- LSI must be defined at table creation time, it’s an alternate range key for the table,
local to the hash key up to five local secondary indexes per table
- GSI used to speed up the queries when there is non key attributes  GSI =
partition key + optional sort key
 DAX: DynamoDB Accelerator it considered as cache for DynamoDB it doesn’t scale your have
to provision it in advance
 All the changes in DynamoDB (Create, Update, Delete) can end up in a DynamoDB
Stream and stream can be read by AWS Lambda and can implement cross region
replication using stream (stream has 24h of data retention) with configurable batch size
(up to 1000 rows, 6MB) –> we can use Kinesis Adapter (use KCL library to consume from
DynamoDB Streams)




DynamoDB TTL (time to live): in case you want to expire data based on date automatically
delete an item after an expiry date / time. It’s a background task operated by the DynamoDB
service itself.  helps reduce storage and manage the table size over time
Storing large objects (max size in DB = 400 KB): we can store them in S3 and reference
them in DynamoDB. Also for object < 400KB and not accessed very often (under used) so to
lower the cost we can store it in S3 with reference in DB
AWS Elasticache: (real time analytics service) is to get managed Redis or Memcached (in
memory databases with high performance and low latency) that helps reduce load off of
databases
- Redis: in-memory key value store with super low latency
- Memchached: in-memory object store and the cache doesn’t survive reboots
Process

Lambda run code snippets in the cloud, scale automatically. it’s often used to process data as
it’s moved from around from one service to another. It’s used for real-time file/stream
processing, ETL, Process AWS events, Cron replacement (use time as trigger). Support any
language you want to use. We can integrate lambda with different services and AWS APIs and
you can set up IAM roles to access the services.
 btw S3 and Elasticache we can set up a trigger using Lambda one S3 receives data, lambda
could process and analysis this data and send to AWS Elasticache
It’s not a good to build a dynamic website (can use EC2 or could front)
Although you think of a trigger as "pushing" events, Lambda actually polls your
Kinesis streams for new activity.
From email she mentioned that OMDB is data mart that source data from SAP and
The database exists in source layer

GLUE: serverless system that handle discovery and definition of table definitions & schema 
used us central metadata repository for your datalake  the purpose of Glue is to extract
structure for your unstructured data by extracting a schema so that we can query data using
SQL or any sql database  it helps also to custom ETL jobs (doing processing) it uses Apache
Spark under the Hood without having to manage the spark infrastructure.
- Glue crawler/Data Catalog: scans data in S3 creates schema stores only table
definition original data stays in S3 that helps to query unstructured data like it’s
structured  Glue is the glue btw unstructured data and structured data analytics
tools || it will extract partitions of your data based on how your S3 data is
organized
- Glue + Hive: Glue can integrate with hive, we can use glue data catalog as
metadata store for hive and we can import metastore from glue and use it
- Glue ETL: can automatically generate code in scala or python you can modify the
code or provide your own spark or pyspark script, use encryption, can be eventdriven, and you can provision additional “DPU”s (data processing units) to increase
performance of underlying Spark jobs (to know how many jobs you need you can
use job metrics  Glue ETL is a system that let’s you automatically process and
transform you data by graphical interface  the target of Glue can be S3,
JDBC(RDS, Redshift) or Glue Data Catalog  fully managed and cost effective, the
jobs are run on a serverless Spark platform .
Glue ETL: the dynamicFrame is a collection of DynamicRecords and each have
schema (fields, columns) (it’s like spark dataframe object) it has scala and python
APIs.
- Glue Scheduler to schedule the jobs and Glue Triggers to automate runs based
on “events”
- AWS Glue Dev Endpoints: allows to dev code using a notebook and it’s in a VPC
controlled by security groups
- Running Glue jobs:
- AWS Glue DataBrew: a visual data preparation tool (clean and normalize data
faster) used to process large data sets (complicated workflows) allows to create
recipes of transformations that can be saved as jobs within a large project from a


security perspective it integrates with KMS, IAM roles, SSL in transit, CloudWatch
and CloudTrail
- AWS Glue Elastic Views: used to builds materialized views from Aurora RDS,
DynamoDB it’s a SQL database handle any copying or combining/replicatin data
needed
AWS Lake Formation: is built on top of Glue it’s system that makes it easy to set up a secure
data lake in days, loading data & monitoring data flows…
There is no cost for lake formation itself but underlying services incur charges (S3, Glue, EMR,
Athena, Redshift) to dig
EMR (Elastic MapReduce): managed Hadoop framework that runs on EC2 instances, includes
Spark, HBase, Presto, Flink, Hive and more… EMR Notebook to create and run code
EMR cluster is a collection of EC2 instances that are running Hadoop each one of these
instances is called a node that has a role (Type):
- Master node: will manage the cluster
- Core node
- Task node
Download