
















Apache Airflow is a platform to programmatically author, schedule and monitor workflows it supports integration with 3rd party platforms so that you, our developer and user community, can adapt it to your needs and stack.
It is a platform to programmatically author, schedule and monitor workflows.
When workflows are defined as code, they become more maintainable, testable and collaborative
Beautiful UI with the help of flask make easier to visualize pipeline running in production.
Principles Of Airflow:
Dynamic: can be coded in python. (Instantaneous Pipelines Dynamically).























Extensible: Easily define your own operators, Executers etc.
Elegant: Made by using jinja template
Scalable: arbitrary number of workers.
Airflow Architecture:
The DAG specifies the dependencies between Tasks and the order in which iterations are to be executed and run; the Tasks themselves describe what to do, be it loading data, running an analysis, starting other systems, or more.















An Airflow installation generally consists of the following components:
1. A scheduler that handles both running scheduled workflows and sending tasks to an executor to run.
2. An executor that takes care of running tasks. In a default installation of Airflow, everything runs in the scheduler, but most production-ready executors actually push task execution to workers.
3. A web server that provides a convenient user interface for inspecting, running, and debugging the behavior of DAGs and jobs.
4. A folder of DAG files read by the scheduler and executor (and any workers the executor has)
5. A metadata database used by the scheduler, executor, and web server to store state.
Airflow Dags:
A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run.
















Declaring a DAG
• with DAG( "my_dag_name", start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), schedule="@daily", catchup=False ) as dag: op =
EmptyOperator(task_id="task")
• my_dag = DAG("my_dag_name", start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), schedule="@daily", catchup=False) op =
EmptyOperator(task_id="task", dag=my_dag)
• @dag(start_date=pendulum datetime(2021, 1, 1, tz="UTC"), schedule="@daily", catchup=False) def generate_dag(): op =
EmptyOperator(task_id="task") dag = generate_dag()
Task Dependencies

























A Task/Operator does not usually live alone; it has dependencies on other tasks (those upstream of it), and other tasks depend on it (those downstream of it). Declaring these dependencies between tasks is what makes up the DAG structure (the edges of the directed acyclic graph).
There are two main ways to declare individual task dependencies. The recommended one is to use the >> and << operators:
1. first_task >> [second_task, third_task] third_task << fourth_task
2. first_task.set_downstream(second_task, third_task) third_task.set_upstream(fourth_task)
Scheduling DAGs
DAGs can be schedule in one of two ways:
• When they are triggered either manually or via the API
• On a defined schedule, which is defined as part of the DAG
1. with DAG("my_daily_dag", schedule="@daily"):
2. with DAG("my_daily_dag", schedule="0 * * * *"):
Airflow operators
There are 3 main types of operators:
Operators that perform an action or tell another system to perform an action.
Transfer Operators move data from one system to another
Sensors a certain type of operators that will keeps running until a certain criterion is met.
Operators are Python classes that encapsulate logic to do a unit of work. They can be viewed as a wrapper around each unit of work that defines the actions that will be completed and abstract the majority of code you would typically need to write. When you create an instance of an operator in a DAG and provide it with its required parameters, it becomes a task.
All operators inherit from the abstract BaseOperator class, which contains the logic to execute the work of the operator within the context of a DAG.




















The following are some of the most frequently used Airflow operators:
• PythonOperator: Executes a Python function.
• BashOperator: Executes a bash script.
• KubernetesPodOperator: Executes a task defined as a Docker image in a Kubernetes Pod.
• SnowflakeOperator: Executes a query against a Snowflake database.












Executor:
Executors are the mechanism by which task instances get run. They have a common API and are “pluggable”, meaning you can swap executors based on your installation needs.
Airflow can only have one executor configured at a time; this is set by the executor option in the [core] section of the configuration file.
Built-in executors are referred to by name, for example: [core]
executor = KubernetesExecutor


















Installation Of Apache Airflow in CENTOS- 7
Requirement:
The minimum hardware requirements:
• CPU: A 8-core CPU is sufficient for a moderate setup with more concurrent tasks.
• RAM: A moderate of 4 GB RAM to 8GB is recommended, but the exact amount will depend on the size and complexity of your DAGs and the number of concurrent tasks.
• Disk space: The disk space required will depend on the size of your DAGs, logs, and metadata, but a minimum of 100 GB of disk space is recommended.
The minimum requirements for running an Apache Airflow server on a CentOS system are:
• Python version 3.5 or later
• A database system such as PostgreSQL or MySQL to store metadata about DAGs and task instances
• A message broker such as RabbitMQ or Redis to manage task queueing and communication between worker nodes
In addition, the following dependencies are required:
• Flask
• Alembic
• sqlalchemy
• cryptography
• celery
• redis (if using Redis as the message broker)
• psycopg2 (if using PostgreSQL as the database)





• mysql-python (if using MySQL as the database)

Installing and Configuring in system for Apache Airflow: Preparing the Environment
Install all needed system dependencies
• yum -y install python3-pip
• yum install build-essential
• sudo yum -y install gcc gcc-c++
• yum -y groupinstall "Development tools"
• yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel pythondevel wget cyrus-sasl-devel.x86_64 unzip
• yum install python3-devel
• yum install mysql-community-devel.x86_64
Install all needed Python dependencies
• pip3install upgrade ignore-installedpipsetuptools
• pip3 install neuralpy
• pip3 install upgrade pip
• pip3 install -r requirements.txt
• pip3 install wheel
• pip3 install virtualenv
• pip install airflow-plugins
alembic==1.7.7
amqp==5.1.1
anyio==3.6.1
apache-airflow==2.2.5
apache-airflow-providers-ftp==2.1.0
apache-airflow-providers-http==2.1.0
apache-airflow-providers-imap==2.2.1
apache-airflow-providers-sqlite==2.1.1
apache-airflow-providers-ssh==2.4.1
apispec==3.3.2
argcomplete==2.0.0
async-generator==1.10
attrs==20.3.0
Babel==2.11.0
bcrypt==4.0.1
billiard==3.6.4.0
blinker==1.5
cached-property==1.5.2
cachelib==0.6.0
cattrs==1.0.0
celery==5.1.2
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.0.12
click==7.1.2
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.2.0
clickclick==20.10.2
colorama==0.4.5
colorlog==6.7.0
commonmark==0.9.1
connexion==2.14.2
contextlib2==21.6.0
contextvars==2.4
croniter==1.3.8
cryptography==39.0.0
cycler==0.11.0
dataclasses==0.8
defusedxml==0.7.1
Deprecated==1.2.13
dill==0.3.4
dnspython==2.2.1
docutils==0.16
email-validator==1.3.1
et-xmlfile==1.1.0
ez-setup==0.9
fastapi==0.79.0
Flask==1.1.4
Flask-AppBuilder==3.4.5
Flask-Babel==2.0.0
Flask-Caching==1.10.1
Flask-JWT-Extended==3.25.1
Flask-Login==0.4.1
Flask-OpenID==1.3.0
Flask-Session==0.4.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==0.14.3
graphviz==0.19.1
greenlet==1.0.0
gunicorn==20.1.0
h11==0.12.0
httpcore==0.14.7
httpx==0.22.0
idna==3.3
immutables==0.18
importlib-metadata==4.8.3
importlib-resources==5.4.0
inflection==0.5.1
iso8601==1.1.0
itsdangerous==1.1.0
Jinja2==2.11.3
jsonschema==3.2.0
kiwisolver==1.3.1
kombu==5.1.0
lazy-object-proxy==1.7.1
lockfile==0.12.2
mailerpy==0.1.0
Mako==1.1.6
Markdown==3.3.7
MarkupSafe==2.0.1
marshmallow==3.14.1
marshmallow-enum==1.5.1
marshmallow-oneofschema==3.0.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.3.4
mysql-connector==2.2.9
nepali-datetime==1.0.7
numpy==1.19.5
openpyxl==3.0.7
packaging==21.3
pandas==1.1.5
paramiko==3.0.0
pendulum==2.1.2
pep562==1.1
Pillow==8.4.0
prison==0.2.1
prompt-toolkit==3.0.36
psutil==5.9.4
pycparser==2.21
pydantic==1.9.2
Pygments==2.14.0
PyJWT==1.7.1
pyminizip==0.2.4
PyMySQL==1.0.2
PyNaCl==1.5.0
pyparsing==3.0.9
pyrsistent==0.18.0
pysftp==0.2.9
python-daemon==2.3.2
python-dateutil==2.8.1
python-nvd3==0.15.0
python-slugify==4.0.1
python3-openid==3.2.0
pytz==2021.1
pytzdata==2020.1
PyYAML==6.0
requests==2.27.1
rfc3986==1.5.0
rich==12.6.0
setproctitle==1.2.3
six==1.15.0
sniffio==1.2.0
SQLAlchemy==1.3.24
SQLAlchemy-JSONField==1.0.0
SQLAlchemy-Utils==0.39.0
sshtunnel==0.4.0
starlette==0.19.1
swagger-ui-bundle==0.0.9
tabulate==0.8.10
tenacity==8.1.0
termcolor==1.1.0
text-unidecode==1.3
typing==3.7.4.3
typing_extensions==4.1.1
unicodecsv==0.14.1
urllib3==1.26.11
vine==5.0.0
wcwidth==0.2.6











Werkzeug==1.0.1
wrapt==1.14.1
WTForms==2.3.3
xlrd==2.0.1
zipp==3.4.1
***Note: (If GPG-KEY GIVES ISSUE you can import recent key during installation) rpm import https://repo.mysql.com/RPM-GPG-KEY-mysql-2022
This note is related to installing the MySQL database management system using the RPM package manager. The issue being referred to is related to a security feature called "GPG keys". A GPG key is a key used to encrypt and sign packages and verify the authenticity of package sources. The note suggests that if you encounter an issue with the GPG key when installing MySQL, you can import the most recent key from the specified URL to resolve the issue
Configuring Airflow:
• export AIRFLOW_HOME=~/airflow
export AIRFLOW_HOME=~/airflow is an environment variable that sets the Airflow home directory. This environment variable tells Apache Airflow where to look for its configuration files, DAGs, and other metadata. By default, Apache Airflow looks for this information in the $AIRFLOW_HOME/airflow directory.By setting AIRFLOW_HOME to ~/airflow, we're specifying that the Airflow home directory should be located in the airflow subdirectory of the home directory of the current user.
Configuring Database:
CREATE DATABASE airflow CHARACTER SET utf8 COLLATE utf8_unicode_ci; CREATE USER 'airflow'@'%' IDENTIFIED BY 'Airflow@123'; GRANT ALL ON airflow.* TO 'airflow'@'%';
Apache Airflow uses a database to store its metadata, such as information about DAGs, task instances, and user information. The database is used by the Airflow scheduler to determine what tasks to run, when to run them, and their dependencies. The database is also used to store the status of tasks and their execution history.


By creating a dedicated database and user for Apache Airflow, you can ensure that the metadata is stored in a secure and isolated location that is separate from other databases or users in your system. This can help to improve security, data management, and performance.
Additionally, having a dedicated database and user for Apache Airflow enables you to easily manage the metadata, including backing up, restoring, or moving the metadata as needed. This can be especially useful in large, complex Airflow deployments where the metadata can become quite large and complex over time.
Initialization of Airflow
airflow (for creation of .cfg file)
Apache Airflow uses a configuration file to set various settings and options for the Airflow environment. The configuration file is usually named airflow.cfg and is stored in the $AIRFLOW_HOME directory.
To create an airflow.cfg file in Apache Airflow, you can use command airflow.
Make the following changes to the {AIRFLOW_HOME}/airflow.cfg file:
sql_alchemy_conn = mysql+pymysql://airflow:Airflow@123@localhost/airflow
executor = LocalExecutor
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
default_timezone = Asia/Kathmandu
***Note: mysql my.cnf (explicit_defaults_for_timestamp=1) (SETGLOBALexplicit_defaults_for_timestamp=1;) if mysql is less than 8
The line explicit_defaults_for_timestamp = 1 in the my.cnf file sets the global system variable explicit_defaults_for_timestamp to 1. This option controls the behavior of the TIMESTAMP data type in MySQL.
In MySQL versions before 8.0, the default behavior of the TIMESTAMP data type is to automatically set the current timestamp whenever a new row is inserted into the table and the value for the TIMESTAMP column is not specified.
Starting with MySQL 8.0, the default behavior is changed to require the explicit specification of a value for the TIMESTAMP column. The explicit_defaults_for_timestamp option can be used to control this behavior and restore the previous behavior for backwards compatibility.
The line SET GLOBAL explicit_defaults_for_timestamp = 1; sets the global system variable explicit_defaults_for_timestamp to 1 for the current session. This change will only persist for the duration of the session and will not be saved across restarts of the MySQL server.
These options are relevant for Apache Airflow database initialization because Airflow uses the TIMESTAMP data type to store timestamps in its metadata database. By setting the explicit_defaults_for_timestamp option, you can ensure that the timestamps are stored in the metadata database as expected and can avoid any compatibility issues that might occur with newer versions of MySQL.
Initialization of Airflow Database:
• airflow db init
User Creation For Airflow :
• airflow users create \ username admin \ firstname Admin \ lastname User \ role Admin \ email @gmail.com
** Different other roles
Airflow Usercreate with role:
airflow users create \










--username <username> \ firstname DBA \ lastname Team \ role user \ email @gmail.com
Airflow User delete: airflow users delete -u dba_users
Other Necessary changes can be made for smtp and other service using: Link

Running Airflow as Systemctl Service:
Note:
The systemd files in this directory are tested on RedHat-based systems. Copy (airflow-flower.service,airflow-kerberos.service,airflow-scheduler.service, airflow-webserver.service,airflow-worker.service) to /usr/lib/systemd/system and Copy the airflow.conf file to /etc/tmpfiles.d/ or /usr/lib/tmpfiles.d/. Copying airflow.conf ensures that /run/airflow is created with correct owner and permissions (0755 airflow airflow)

You can then start the various servers using systemctl start <service>. Enabling services can be done by issuing systemctl enable <service>.
• By default, the environment configuration points to /etc/sysconfig/airflow(ie export AIRFLOW_HOME=~/airflow). You can copy the "airflow" file here. directory and edit it to your liking.
With some minor changes, they probably work on other systemd systems.
Files Required:
airflow.cnf
D /run/airflow 0755 root root
airflow-scheduler.service
[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
Environment=/root/airflow
User=root
Group=root
Type=simple
ExecStart=/usr/bin/bash -c 'airflow scheduler pid /root/airflow/airflow-scheduler.pid'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
airflow-webserver.service
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
Environment=~/airflow
User=root
Group=root


Type=simple
ExecStart=/usr/bin/bash -c 'airflow webserver pid /root/airflow/airflow-webserver.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
The other service needed to be added (if required ) can be install from the link

• sudo systemctl enable airflow-webserver
• sudo systemctl enable airflow-scheduler
• sudo systemctl start airflow-webserver
• sudo systemctl start airflow-scheduler
These commands are used to manage the Apache Airflow services on a Linux system that uses Systemd as the init system (such as CentOS or Ubuntu).
1. sudo systemctl enable airflow-webserver: This command enables the Apache Airflow webserver service to start automatically when the system is rebooted.
2. sudo systemctl enable airflow-scheduler: This command enables the Apache Airflow scheduler service to start automatically when the system is rebooted.



3. sudo systemctl start airflow-webserver: This command starts the Apache Airflow webserver service.
4. sudo systemctl start airflow-scheduler: This command starts the Apache Airflow scheduler service.
Once these commands are executed, the Apache Airflow webserver and scheduler services should be running and available for use. You can use the systemctl status airflow-webserver and systemctl status airflow-scheduler commands to check the status of the services.
***NOTE: The server runs in 0.0.0.0:8080 by default can be change by changing in {AIRFLOW_HOME}/airflow.cfg file
Airflow Code Editor Plugin
Link
Link3
Link2
(Not feasible these days)












Chmod 755 {AIRFLOW_HOME}/dags/{dag_name}
systemctl restart airflow-scheduler
Note: The server runs in 0.0.0.0:8080 by default
This is a sample code for a first Apache Airflow DAG. The DAG consists of three tasks:
1. "Get current datetime" using the BashOperator to run the "date" command and
2. "Process current datetime" using the PythonOperator to process the result of the previous task and return a dictionary with datetime information.

3. "SQL Dump" using PythonOperator to run the "mysqldump" command and dump all databases.

The DAG is scheduled to run once a day at 1:00 AM, starting from February 1, 2022. The catchup parameter is set to False, meaning it will only run tasks for the current date and not backfill missed runs. The tasks are linked by using the task operator ">>".
Finally, the DAG file is made executable by changing its permissions to 755 and the Airflow scheduler is restarted. After this, the Airflow UI will be accessible at 0.0.0.0:8080.
Starting DAG for connection to different server
apache-airflow-providers-ssh
1. SCHEDULING IN DIFFERENT SERVER USING SSH OPERATOR from datetime import timedelta, datetime import airflow from airflow import DAG from airflow.contrib.operators.ssh_operator import SSHOperator from datetime import date
today = date.today().strftime("%Y%m%d")
t1_bash = """mv test.sql test{}.sql""".format(today)
t2_bash = """mysqldump -uroot -pRoot@123 all-databases > test.sql"""
with DAG( dag_id='testing_stuff', schedule_interval='* * * * *', start_date=datetime(year=2022, month=2, day=1), catchup=False
) as dag:
t1 = SSHOperator( ssh_conn_id='ssh_nishant', task_id='test_ssh_operator_mv', command=t1_bash, dag=dag)
t2 = SSHOperator( ssh_conn_id='ssh_nishant', task_id='test_ssh_operator_backup', command=t2_bash, dag=dag)
Note:** add space at last to run bash script file
Eg command = ‘bibek.sh ’
t1 >> t2This code is for scheduling an Airflow DAG to run on a different server using the SSH operator. The DAG will perform the following tasks:
1. Rename the test.sql file to a file named test{today's date}.sql using the "mv" command in the first SSHOperator (t1)
2. Run the mysqldump command to backup all databases in the second SSHOperator (t2)
The ssh_conn_id parameter in both SSHOperator objects refers to an Airflow connection that has been set up with the credentials and settings necessary to SSH into the remote server. The start_date of the DAG is set to 2022-02-01 and the schedule_interval is set to run every minute (*/1 * * * *). The DAG is assigned a dag_id of "testing_stuff".
Note that the last line, t1 >> t2, is the operator to set t1 as the upstream task for t2, meaning t2 will run only after t1 is completed.
Index:
Type
Bash Script
Sql Script
Text to be written or added

Python
References:
1. https://blog.clairvoyantsoft.com/installing-and-configuring-apache-airflow-619a1df3300f
2. https://betterdatascience.com/apache-airflow-write-your-first-dag/
3. https://stackoverflow.com/questions/69777054/cannot-setup-a-mysql-backend-forairflow-localexecutor
4. https://stackoverflow.com/questions/52948855/how-to-use-airflow-scheduler-withsystemd
5. https://teguharif.medium.com/data-engineering-series-run-apache-airflow-as-a-serviceon-centos-7-apache-airflow-2-f9ea16fdaef8
6. https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html
7. https://docs.astronomer.io/learn/airflow-sql
8. https://docs.aws.amazon.com/mwaa/latest/userguide/t-apache-airflow-11012.html
9. https://airflow.apache.org/docs/apache-airflow-providersgoogle/stable/operators/cloud/bigquery.html
10. https://stackoverflow.com/questions/60809411/send-output-of-oracleoperator-to-anothertask-in-airflow