tips that help you leverage your Airflow pipelines About me - Aliaksandr Sheliutsin 6+ years in Data Engineering 1.7 - 2.7 Airflow experience Prerequisite: Airflow What is Airflow? Apache Airflow is an open-source platform for developing, scheduling, and monitoring batchoriented workflows. - Batch-oriented - Schedule-oriented - Workflows as code Prerequisite: DAG, DagRun What is DAG? A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run. What is DagRun? A DagRun is just an instance of DAG with specific time parameters added during execution. Prerequisite: Operator, Task, TaskInstance What is Operator? An operator defines a unit of work for Airflow to complete. Examples of operators: - BashOperator PythonOperator PostgresOperator What is Task? A Task is an instance of an Operator with specific parameters added by a developer. What is TaskIntance? A TaskInstance is an instance of a Task with specific time parameters added during execution. Prerequisite: How to add a DAG to Airflow Here are the steps: 1. 2. Create a python file with a DAG definition. Add this file to the DAGs folder ($AIRFLOW_HOME/dags). You can create typical DAGs based on a CSV file CASE: // Catchup pipelines for historical data download // Triggers for support dags Easy to add a new entity Less code PROS: Flexible Isolation/Can control start date Also covered with data validation tests CONS: Advanced understanding of Airflow Kind of similar pipelines You can create typical DAGs based on a sharable Config (like Google Sheets) CASE: // Catchup pipelines for historical GA pipelines PROS: Other people can create Airflow DAGs Changes are available without deployments Performance CONS: You need to handle user experience // Documentation // Input data validation // Error Notifications Autogenerate DAGs based on the Airflow Entities (Connection, Variables) CASE: // Catchup pipelines for different databases PROS: No need to handle External API calls CONS: Configurations can be lost Higher load on Metadata DB Make your Airflow pipeline event-driven CASE: // Domain security tests execution via PubSub - Cloud Function PROS: Results of execution do not rely on a schedule Don’t have tons of one-minute Dag Runs CONS: This way will not work with a high volume of events Have questions? // Linkedin: https://www.linkedin.com/in/aliaksandr-sheliutsin/ // Telegram: @asheliutsin