What is a data lake?

A data lake stores a collection of various raw data sets from multiple internal and external data sources. The data can be unstructured, semi-structured or structured. (Having a hard time visualizing it 🤯? Here are some analogies to help you out.)

James Dixon, the person who created the term "data lake", puts it this way.

If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

Here's an even simpler analogy from Alex Gorelik, author of the book titled "The enterprise data lake".

A data lake is sort of like a piggy bank. You often don’t know what you are saving the data for, but you want it in case you need it one day.

A data lake is like a piggy bank. Image courtesy: Alex Gorelik

Examples of data lakes include Amazon S3 and Azure Blob Storage.

Who uses data lakes?

Anyone can work with a data lake.

However, using data lakes requires you to be familiar with data processing and analysis techniques. That's why it's data scientists and data engineers that mostly work with data lakes.

Isn't a data lake the same as a data warehouse?

It's a common misconception. We get it. But a data lake is not data warehouse 2.0. Completely different repositories built for different purposes.

Comparing data warehouses and data lakes. Image courtesy: Thorn Technologies

Let's bust those misconceptions by looking at some of the main differences between a data lake and a data warehouse.

1. The way you store data is different

Before storing data in a data warehouse, you need to model it—provide it with a structure. This process is called schema-on-write.

For data lakes, you can store raw data just the way it is. You can model it whenever you need to use it for analysis. This process is called schema-on-read.

2. Data lakes are flexible; data warehouses aren't

A lot of effort and decision-making is involved before storing data in a warehouse. This includes defining the business questions to be answered, processing raw data and transforming raw data to a structured format. That's why reconfiguring a data warehouse is tedious and time-consuming.

On the other hand, data lakes can be configured and reconfigured easily since they don't have a predefined structure.

3. Scaling data warehouses is challenging

Traditionally, organizations invested heavily in data warehouses to process and store data that answer specific business questions. However, scaling data warehouses is expensive, and changing the structure of the data stored is cumbersome.

Data lakes offer a solution to the challenges posed by data warehouses since they're cheap (for storing massive volumes of data), highly scalable and flexible.

4. Data warehouses are more secure than data lakes

Data warehouses have been around for a while and as a result, they're fully equipped to deal with data security and integrity. Data lakes are still evolving and aren't as secure as data warehouses yet.

Also, since data lakes store all kinds of data in a single repository, they might make your data more vulnerable.

Why should you store data in a data lake?

For organizations that work with big data, data lakes offer a low-cost storage alternative that overcome the challenges presented by data warehouses.

Data warehouses store historic data—what happened in the past and what conclusions can you draw from the past. With data lakes, you can use data to explain not just what happened in the past, but also what’s likely to occur in the future.

Moreover, since real-time data can be directly streamed into data lakes, you can also do real-time analytics.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Building a data lake
  2. What's with data lakes? Five questions, answered.