What is Hadoop?

Apache Hadoop is a set of open-source software programs that allows you to process large data sets across clusters of computers using distributed systems. It is one of the most widely used frameworks to manage big data processing and storage.

What does Hadoop do?

Along with other tools and applications in its ecosystem, Hadoop helps collect, store, process, analyze and manage big data.

Hadoop processes massive amounts of data using distributed computing—a network of computers that don't share any memory or disks. Since data isn't stored in a central repository, hardware failures don't affect the data stored using the Hadoop framework.

Doug Cutting, one of the original creators, named the framework 'Hadoop' after his son's beloved yellow toy elephant.🐘

What are the different modules of Hadoop?

Hadoop is made up of four main modules—HDFS, MapReduce, Hadoop Common and YARN. The primary components are HDFS and MapReduce.

1. HDFS (Hadoop Distributed File System)

HDFS is the primary data storage used by Hadoop applications and can store massive amounts of unstructured data across several connected storage devices.

In HDFS, the files are broken down into smaller blocks and distributed across a cluster—a system of linked computers. Each block is replicated several times to avoid data loss due to a hardware malfunction.

The computers in the cluster are called nodes. The nodes can be NameNodes (that manage the file system metadata) or DataNodes (that store the actual data).

2. MapReduce

Initially developed by Google, you can use MapReduce to analyze data in Hadoop. This takes place in two stages:

  1. Map: Read input data from a database and prepare it for analysis
  2. Reduce: Process the data and perform mathematical operations

To work with MapReduce, you must be well-versed with Java.

3. Hadoop Common

Hadoop Common contains Java libraries and utilities required by all Hadoop modules to read data stored in the Hadoop file system.

4. Apache YARN (Yet Another Resource Negotiator)

Apache YARN is a resource manager. In Hadoop, it is the central platform responsible for managing computing resources across Hadoop clusters and scheduling jobs.

These four core modules of Hadoop are supported by a larger ecosystem of big data technologies and products such as HBase, Pig, ZooKeeper and Hive.

The Hadoop ecosystem. Image courtesy: Towards Data Science

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. What Is Hadoop?
  2. Hadoop: What it is, how it works and what it can do
  3. The 6 best Hadoop vendors for your big data project
  4. Hadoop examples: 5 real-world use cases