AWS Certified Data Analytics Notes Building a data lake on AWS: 1. Choose the right storage: Amazon S3: it’s the central pillar of AWS data lake offer is secure and highly scalable and offers durable object storage with milliseconds latency and allows to store any type of data. S3 integrate a storage lifecycle management 2. Data ingestion: move data into the data lake It depends on volume, variety and velocity If it btw the source data center and data lake: might need direct connector use AWS direct Connect (data collection service) Moving large amounts of data (terabytes): use AWS snowball Multiple terabytes of data : use AWS CLI over an internet connection Perabyte or exabytes: snowmobile Move data from SQL data base engine : AWS Glue and AWS database migration 3. Cleanse, prep and catalog the data GLUE to schedule ETL jobs Use Athena, EMR and Redshift to create and reuse data catalog and AWS lake formation 4. Secure the data and metadata Use IAM user policies or KMS to enable client and server-side encryption 5. Make data available for analytics Using lake formation to build a data lake Data Collection The collection phase of the data analytics pipeline focuses on ingesting raw data from transactions, logs, and Internet of Things (IoT) devices. A good data analysis solution allows developers to ingest a wide variety of data - structured, semistructured, and unstructured - at any speed, from batch to streaming. This domain consists of three subdomains: 1.1: Determine the operational characteristics of the collection system 1.2: Select a collection system that handles the frequency, volume, and source of data 1.3: Select a collection system that addresses the key properties of data, such as order, format, and compression Amazon Kinesis Data Streams ingests and store data streams for processing *Streaming Data: Also known as event stream processing, streaming data is the continuous flow of data generated by various sources. You can use Amazon Kinesis Data Streams to collect and process large streams of data records in real time. Kinesis Producers: producer send data to kinesis we do a put record Kinesis SDK: PutRecords API (sending only latest data) you send one or many records (batching and increases throughput less HTTP requests) USE CASE: LOW THROUGHPUT, HIGHER LATENCY, SIMPLE API, AWS LAMDA Kinesis Producer Library (KPL) easy to use highly configurable C++ / Java library (used for building high performance long-running producers) send data synchronously or asynchronously Batching: aggregate into one record by introducing some delay with RecordMaxBufferdTime (default 100ms) We cannot use KPL if an app cannot tolerate the additional delay, we can use the AWS SDK directly USE CASE: WRITES TO ONE OR MORE KINESIS DATA STREAMS, AGGREGATES USER RECORDS TO INCREASE PAYLOAD SIZE AND IMPROVE THROUGHPUT, COLLECTS RECORDS AND USES PUTRECORDS TO WRITE MULTIPLE RECORDS TO MULTIPLE SHARDS PER REQUEST AWS Kinesis API: provisioned throughput exceeded exceptions Kinesis Agent: Monitor log files and sends them to Kinesis Data Streams. It’s a javabased agent built on top of KPL Apache Spark (3rd party libraries) Kafka (3rd party libraries) Managed AWS sources for kinesis data streams: CloudWatch logs, AWSIoT, Kinesis data analytics Kinesis Consumers: Kinesis SDK: records are polled by consumer from a shard (using get record API call) The GetRecords returns up to 10 MB of data or up to 10000 records with 5 calls per shard per second == 200 ms latency Kinesis Client Library (KCL): read records with de-aggregation, share multiple consumers in one group shared discovery, leverage DynamoDB for coordination and checkpointing Kinesis Connector Library (KCL): used to write data to S3, DynamoDB, Redshift, elasticsearch service and it must be running on EC2 (Kinesis firehose replace it) 3rd party libraries Kinesis Firehose AWS Lambda: can source records from kinesis data streams it has a library to deaggregate record from KPL. It cab be used to run lightweight ETL to anywhere you want. It’s used to trigger notification/send emails in real time Kinesis Enhanced Fan Out: used in case you have multiple consumer applications for the same stream and low latency requirement. each consumer get 2 MB/s of provisioned throughput per shard it pushes data to consumers over HTTO/2 with 70 ms reduced latency with limit of 5 consumers SCALE KINESIS // by adding shards shard splitting (splitting hard shard) one shard write 1MB per second (1000 records per second) and 2 MB per second Merging shards decrease the stream capacity and save costs used in case of two shards with low traffic Resharding you can read data from child shards, however data you haven’t read yet could still be in the parent /// after a reshared read entirely from the parent until you don’t have new records Autoscaling is not a native feature of kinesis it’s implemented with AWS Lambda Resharding cannot be done in parallel scaling in kinesis is not instantaneous it takes time Handling Duplicates Producers can create duplicates due to network timeouts using a PutRecord/Retry PutRecord although the two records have identical data they also have unique sequence numbers fix: embed unique record ID in the data to de-duplicate on the consumer side Consumer retries can make your application read the same data twice when record processors restart: - A worker terminates unexpectedly - Worker instances are added or removed - Shards are merged or split - The application is deployed Fix: make consumer app idempotent / if the final destination can handle duplicates it’s recommended to do it there Kinesis Security: Control access/ authorization using IAM policies Encryption in flight using HTTPS endpoints Encryption at rest using KMS Client-side encryption VPC endpoint AWS Kinesis Data Firehose KDF load data into amazon data store used to deliver real-time streaming data to destinations (S3, Redshift, Elasticsearch, or Splunk) Fully managed service, does not require any administration || near real time (60 s latency min for non full batches). used to load data into Redshift/ Amazon S3/ ElasticSearch / Splunk. With automatic scaling. Data conversion from JSON to Parquet/ORC for S3 || data transformation through AWS lambda. pay only for the amount of data going through firehose Spark/KCL DO NOT READ FROM KDF Firehose Buffer Sizing: firehose accumulates records in a buffer the buffer is flushed based on time and size rules (need to define buffer size and buffer time) for high throughput buffer size will be hit / low throughput buffer time will be hit (minimum 1 min/60 seconds) CloudWatch logs: are real time streaming logs and they can be stream into three destinations (KDS, KDF, Lambda) using CloudWatch Subscriptions filters with the AWS CLI. SQS is a Queue producers send messages to the queue and consumers pull messages from it. Default retention of messages is 4 days max 14 days. AWS IoT: is a managed platform that enables you to connect IoT devices to various AWS services and other devices in secure manner. it’s not a single service but rather provides the complete ecosystem to build, monitor and analyze an IoT platform The need to connect the devices and collect, store and analyze the data being generated. The three biggest challenges in the IoT world: Connectivity to devices, data collection, and performing actions on the devices Controlling, managing, and securing large fleets of devices Extracting the insights from the data being generated by the devices Device Software: Amazon FreeRTOS (an IoT operating system for microcontrollers) and AWS IoT Greengrass, which extends AWS to the edge devices so that they can act locally on the data they generate while still using the cloud for management, analytics, and storage. The control services include AWS IoT Core, which is responsible for securing the device connectivity and messaging AWS IoT Device Management to onboard fleets of devices and provides management and software updates to the fleet AWS IoT Device Defender, which is responsible for fleet audit protection; and AWS IoT Things Graph, which is responsible to connect devices and web services. IoT core: Device gateway (could have a msg broker) helps to communicate with the AWS cloud Rule engine contains a bench of rules that you can define allowing you to modify the behavior of you devices CAN ALSO Define actions to send data to many different targets within AWS Device shadow is a shadow of the device in the case of internet cuts IoT Device Gateway: it’s an entry point for IoT devices connecting to AWS and allows devices to securely and efficiently communicate with AWS IoT support MQTT, WebSockets and HTTP 1.1 protocols; It’s fully managed and scales automatically to support over a billion of devices IoT Message Broker pub/sub (publishers/subscribers) messaging pattern. Devices can communicate with one another with AWS IoT support MQTT, WebSockets and HTTP 1.1. it send messages to all clients connected to the topic IoT Thing Registry: all connected IoT devices are represented in AWS IoT registry. Organize the resources associated with each device in the AWS Cloud. Each device gets a unique ID. Supports metadata for each device DMS: quickly and securely migrate databases from on-premise to AWS, it’s resilient and self-healing. Used for Homogeneous and heterogeneous migrations with a continuous data replication using CDC Usually we need to us SCT (schema converging tool) but we do not need it if we are migrating the same DB engine Direct Connect (DX): provide a dedicated private connection from a remote network to your VPC need to setup a virtual private gateway on the VPC. Use Direct connect gateway AWS Snow Family: transfer data that could take more than a week to transfer over the network (an offline devices to perform data migrations) transfer data to AWS trough a physical root not network root use OpsHub to manage snow family devices Managed Streaming for Apache Kafka MSK: fully manage Apache Kafka on AWS can create custom configurations for your clusters the default msg size is 1MB but it’s possible to send large msgs (ex 10 MB) into Kafka after custom configuration (data is stored on EBS volumes) The producers (code) write data to a kafka Topic and consumers Poll from topic Security in Kafka: - we have Encryption in flight using TLS btw the brokers | btw the clients and brokers | at rest for your EBS volume using KMS network security: authorize specific groups for your Kafka clients Authentication & authorization: define who can write/read to which topic | IAM Access control Monitoring MSK vs Kinesis CloudWatch metrics Prometheus Broke log Delivery Data Storage AWS S3: bucket must have a globally unique name and it’s defined at the region level – Objects (Files) have a key (full path) with max size is 5TB for more than 5GB must use multi-part-upload we can use Tags (useful for security and lifecycle) and versioning. All operations are consistence S3 replication: CRR (cross region replication)/SRR (same region replication) - Must enable versioning in source and destination Copying can be asynchronous Must give proper IAM permission to S3 After activating replication, only new objects are replicated (not retroactive) - SSE-S3: encryption using keys handled & managed by Amazon S3, service side encryption (SSE) and must set a header SSE-KMS: encryption using keys handled & managed by KMS SSE-C using data keys fully managed by the customer outside of AWS (S3 does not store the encryption key you provide) Client-Side Encryption Encryption: - S3 Select & Glacier Select: It’s an API - Retrieve less data using SQL by performing Server-side filtering Can filter by rows & columns (simple sql statements) Less network transfer, less CPU cost client-side Glacier select can only do uncompressed CSV files S3 Event Notifications: You can create event notification rules - SNS SQS Lambda function DynamoDB: good place to store data that needs to be write very often is fully managed database, highly available with replication across 3 AZ It’s a NoSQL database (not a relational database) Scale to massive workloads, distributed database Millions of requests per seconds, trillions of rows, 100s of TB of storage Fast and consistent in performance (low latency on retrieval Integrated with AIM for security, authorization and administration Low cost and autoscaling capabilities Basics: DynamoDB is made of tables, each table has a primary key (must be decided at creation time), each table can have an infinite number of items (rows), each item has an attributes (can be added over time-can be null) max size of an item is 400KB Primary keys: Provisioned throughput: provision read (RCU read capacity units) and write (WCU write capacity units) capacity units - WCU write capacity units represents one write per second for an item up to 1KB in size. If items are larger than 1KB more WCU are consumed (for 10 objects per sd of 2 KB each we need 2*10 = 20 WCU | for 6 objects per sd of 4,5 KB each we need 6*5 = 30 WCU “4,5 get rounded to the upper KB” | for 120 objects per minute of 2KB we need 120/60 * 2 = 4WCU) - RCU read capacity units there is two types Eventually Consistent Read (if we read just after a write it’s possible we’ll get unexpected response because of replication) and Strongly Consistent Read (if we read just after a write we will get the correct data). By default DynamoDB is distributed database and uses Eventually Consistent Read One read capacity unit represents one strangly consistent red per second or two eventually consistent reads pe second for an atem up to 4KB in size - Throttling: if we exceed ou RCU or WCU we get Provisioned Throughput Exceeded Exceptions in the case of Hot keys/partitions (one partition is being read too many times solution 1- use exponential back-off (in SDK) 2- Distribuete partition keys as much as possible 3- use DynamoDB Accelerator (DAX) - Partitions Internal: we start with on partition each partition have a max of 3000 RCU / 1000 WCU and max of 10GB || WCU and RCU are spread evenly between partitions DynamoDB APIs: - Writing data: PutItem | UpdateItem | Conditional Writes Indexes: LSI and GSI - LSI must be defined at table creation time, it’s an alternate range key for the table, local to the hash key up to five local secondary indexes per table - GSI used to speed up the queries when there is non key attributes GSI = partition key + optional sort key DAX: DynamoDB Accelerator it considered as cache for DynamoDB it doesn’t scale your have to provision it in advance All the changes in DynamoDB (Create, Update, Delete) can end up in a DynamoDB Stream and stream can be read by AWS Lambda and can implement cross region replication using stream (stream has 24h of data retention) with configurable batch size (up to 1000 rows, 6MB) –> we can use Kinesis Adapter (use KCL library to consume from DynamoDB Streams) DynamoDB TTL (time to live): in case you want to expire data based on date automatically delete an item after an expiry date / time. It’s a background task operated by the DynamoDB service itself. helps reduce storage and manage the table size over time Storing large objects (max size in DB = 400 KB): we can store them in S3 and reference them in DynamoDB. Also for object < 400KB and not accessed very often (under used) so to lower the cost we can store it in S3 with reference in DB AWS Elasticache: (real time analytics service) is to get managed Redis or Memcached (in memory databases with high performance and low latency) that helps reduce load off of databases - Redis: in-memory key value store with super low latency - Memchached: in-memory object store and the cache doesn’t survive reboots Process Lambda run code snippets in the cloud, scale automatically. it’s often used to process data as it’s moved from around from one service to another. It’s used for real-time file/stream processing, ETL, Process AWS events, Cron replacement (use time as trigger). Support any language you want to use. We can integrate lambda with different services and AWS APIs and you can set up IAM roles to access the services. btw S3 and Elasticache we can set up a trigger using Lambda one S3 receives data, lambda could process and analysis this data and send to AWS Elasticache It’s not a good to build a dynamic website (can use EC2 or could front) Although you think of a trigger as "pushing" events, Lambda actually polls your Kinesis streams for new activity. From email she mentioned that OMDB is data mart that source data from SAP and The database exists in source layer GLUE: serverless system that handle discovery and definition of table definitions & schema used us central metadata repository for your datalake the purpose of Glue is to extract structure for your unstructured data by extracting a schema so that we can query data using SQL or any sql database it helps also to custom ETL jobs (doing processing) it uses Apache Spark under the Hood without having to manage the spark infrastructure. - Glue crawler/Data Catalog: scans data in S3 creates schema stores only table definition original data stays in S3 that helps to query unstructured data like it’s structured Glue is the glue btw unstructured data and structured data analytics tools || it will extract partitions of your data based on how your S3 data is organized - Glue + Hive: Glue can integrate with hive, we can use glue data catalog as metadata store for hive and we can import metastore from glue and use it - Glue ETL: can automatically generate code in scala or python you can modify the code or provide your own spark or pyspark script, use encryption, can be eventdriven, and you can provision additional “DPU”s (data processing units) to increase performance of underlying Spark jobs (to know how many jobs you need you can use job metrics Glue ETL is a system that let’s you automatically process and transform you data by graphical interface the target of Glue can be S3, JDBC(RDS, Redshift) or Glue Data Catalog fully managed and cost effective, the jobs are run on a serverless Spark platform . Glue ETL: the dynamicFrame is a collection of DynamicRecords and each have schema (fields, columns) (it’s like spark dataframe object) it has scala and python APIs. - Glue Scheduler to schedule the jobs and Glue Triggers to automate runs based on “events” - AWS Glue Dev Endpoints: allows to dev code using a notebook and it’s in a VPC controlled by security groups - Running Glue jobs: - AWS Glue DataBrew: a visual data preparation tool (clean and normalize data faster) used to process large data sets (complicated workflows) allows to create recipes of transformations that can be saved as jobs within a large project from a security perspective it integrates with KMS, IAM roles, SSL in transit, CloudWatch and CloudTrail - AWS Glue Elastic Views: used to builds materialized views from Aurora RDS, DynamoDB it’s a SQL database handle any copying or combining/replicatin data needed AWS Lake Formation: is built on top of Glue it’s system that makes it easy to set up a secure data lake in days, loading data & monitoring data flows… There is no cost for lake formation itself but underlying services incur charges (S3, Glue, EMR, Athena, Redshift) to dig EMR (Elastic MapReduce): managed Hadoop framework that runs on EC2 instances, includes Spark, HBase, Presto, Flink, Hive and more… EMR Notebook to create and run code EMR cluster is a collection of EC2 instances that are running Hadoop each one of these instances is called a node that has a role (Type): - Master node: will manage the cluster - Core node - Task node