What is a data pipeline?

A data pipeline is a chain of processes to extract data and load it into a data repository, where the output data from one process becomes the input data for the next process.

Think of it like a manufacturing line for data. 🏭

Data engineers build, maintain and manage a data pipeline to transform raw data into a usable format for the entire organization.

An analogy for explaining a data pipeline. Image courtesy: Angela VandenBroek

What is a data pipeline used for?

Data engineers extract and store the data captured from multiple sources and in different formats in a data warehouse, database or even an app so that data teams can analyze this information to derive insights for decision-making.

Making raw data useful involves several steps such as copying data, cleaning it, merging with other data sources, and more. A data pipeline takes care of all these steps in an autonomous manner (aka the manufacturing line for data) while ensuring that the data retains its integrity and credibility.

What is ETL (Extract Transform Load)?

ETL is the process of copying data from its source into a data repository such as a data warehouse.

It stands for Extract, Transform and Load and is a subset of a data pipeline.

  1. Data extraction: You extract data from various sources in a wide variety of formats
  2. Data transformation: You clean, process and organize data in a format suitable for querying and analysis
  3. Data loading: You store data in a data repository such as a data warehouse, a data lake or a database

What is ELT (Extract Load Transform)?

ELT stands for Extract, Load and Transform. Like ETL, ELT is also a data pipeline model.

Unlike ETL where you transform the data before loading, ELT involves loading the data first and then transforming it in a format that's clean and ready for analysis.

With ETL, the business users have to depend on the engineers to transform the data so that it's easily accessible and understandable. Transformation is a tedious and time-consuming process. ELT removes this bottleneck by providing access to data for all as soon as it is stored in a data repository.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Data science for startups: Data pipelines
  2. What is a data pipeline?
  3. ETL vs. ELT: Differences explained