Nisarg Shah

Nisarg Shah

Computer Science Graduate Student


Getting started with Airflow for Your Data Workflows

While working with data products, a developer might have encountered some data tasks which require some scripts or cron jobs. A script might include a series of tasks which needs to be performed on that data. Also, the tasks might be dependent on each other’s execution status and results. In these scenarios, how do you build flows in a structured way? How do you define the dependency of a task and check errors from logs? Using Airflow, you can do everything mentioned above with more flexibility and ease. The pipelines can be triggered daily, developers can get email on failures of specific tasks in the pipeline, and much more.

In this talk, initially I will go through some main benefits of using the Airflow orchestration tool. After that, I will show the comparison of python script and DAG defined in airflow. I will start from the basic installation, understanding DAG and tasks, exploring various operators (PythonOperator and BashOperator), and defining the structure. After this talk, anyone would be able to develop the pipelines in airflow which has the python/bash implementation in every task.

At the end, I would also touch upon some issues regarding fetching logs which a developer can face while using docker-swarm in the pipelines. The workaround is to use multithreading, one thread to read and print logs and the other to check the status of the service.