Uploaded by sann.poker

SystemDesign Copy of Design Presentation 4 tips that help you leverage your Airflow pipelines

advertisement
tips that help you
leverage your
Airflow pipelines
About me
-
Aliaksandr Sheliutsin
6+ years in Data Engineering
1.7 - 2.7 Airflow experience
Prerequisite: Airflow
What is Airflow?
Apache Airflow is an open-source
platform for developing,
scheduling, and monitoring batchoriented workflows.
- Batch-oriented
- Schedule-oriented
- Workflows as code
Prerequisite: DAG, DagRun
What is DAG?
A DAG (Directed Acyclic Graph) is
the core concept of Airflow,
collecting Tasks together,
organized with dependencies and
relationships to say how they
should run.
What is DagRun?
A DagRun is just an instance of
DAG with specific time parameters
added during execution.
Prerequisite: Operator, Task, TaskInstance
What is Operator?
An operator defines a unit of work for Airflow to
complete. Examples of operators:
-
BashOperator
PythonOperator
PostgresOperator
What is Task?
A Task is an instance of an Operator with specific
parameters added by a developer.
What is TaskIntance?
A TaskInstance is an instance of a Task with specific
time parameters added during execution.
Prerequisite: How to add a DAG to Airflow
Here are the steps:
1.
2.
Create a python file with a
DAG definition.
Add this file to the DAGs
folder
($AIRFLOW_HOME/dags).
You can create
typical DAGs
based on a CSV file
CASE:
// Catchup pipelines for historical data download
// Triggers for support dags
Easy to add a new entity
Less code
PROS:
Flexible
Isolation/Can control start date
Also covered with data validation tests
CONS:
Advanced understanding of Airflow
Kind of similar pipelines
You can create
typical DAGs based
on a sharable
Config (like Google Sheets)
CASE:
// Catchup pipelines for historical GA pipelines
PROS:
Other people can create Airflow DAGs
Changes are available without deployments
Performance
CONS:
You need to handle user experience
// Documentation
// Input data validation
// Error Notifications
Autogenerate
DAGs based
on the Airflow
Entities (Connection, Variables)
CASE:
// Catchup pipelines for different databases
PROS:
No need to handle External API calls
CONS:
Configurations can be lost
Higher load on Metadata DB
Make your
Airflow pipeline
event-driven
CASE:
// Domain security tests execution via PubSub - Cloud Function
PROS:
Results of execution do not rely on a schedule
Don’t have tons of one-minute Dag Runs
CONS:
This way will not work with a high volume of events
Have questions?
// Linkedin: https://www.linkedin.com/in/aliaksandr-sheliutsin/
// Telegram: @asheliutsin
Download