What is Azkaban?‌

Azkaban is a job scheduler for batch Hadoop workloads.

(Wait, what? 😵 Let us break it down for you.)

Let's begin by understanding what is a job (in this context).

What are jobs and dependencies?

A job is an independent application that goes out into the data and starts pulling out the needed information.

In a big data ecosystem, several such jobs run simultaneously—there's always an application that's fetching the information required. When this happens, sometimes, the information being acquired by one job (let's call this Job A) is also required by another job (let's say Job B).

In such cases, the latter job (a.k.a. Job B) cannot run before the former job (a.k.a. Job A) is finished. This is called job dependency. (So far so good, yeah?)

Let's look at a real-world scenario. There are two jobs—one that creates insights from data and the second, which emails the insights retrieved to the people who are looking for these insights. Because of the dynamic between the two jobs, the insights job needs to run before the email job.

An example of a job dependency. Image courtesy: Source

That's what Azkaban does for Hadoop.

Linkedin developed Azkaban primarily to solve for their job dependency management problem.

What are the main components of Azkaban?

There are 3 main components in Azkaban:

1. Relational Database: This is a MySQL database which stores the state of all the jobs in Azkaban. This includes information such as:

  1. The job schedule
  2. The jobs getting executed at a given time
  3. Whether a job has succeeded, failed or is being retried
  4. The dependencies of a given job

2. Azkaban Web Server: This is a web interface to track and maintain jobs. It gives you information about:

  1. The state of the jobs getting executed
  2. The history of a job
  3. The jobs that were executed in the past
  4. Logs for all the jobs for all the times that hey ran
  5. The overall state of the workflows

3. Azkaban Executor Server: This retrieves jobs that need to be executed (based on schedule and dependency) from the relational database. After retrieving, it updates the database with the state of the job. It also sends the logs of the jobs back into the database.‌

Is that all Azkaban does? Are there any additional features that Azkaban provides?

Azkaban also offers utilities for scheduling jobs. This is when a job needs to run at regular intervals. When jobs fail, they get retried automatically.

Azkaban also allows you to define SLA and alerting rules for each job.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Searching for a suitable job scheduler? Azkaban is the answer!
  2. How to configure Azkaban workflow for Hadoop from scratch?