What is Apache Flume?

Apache Flume is an open-source, distributed service that collects logs from several sources and takes them to a destination for aggregation and analysis. It is highly fault tolerant.

Flume allows data collection in batch (processing data in batches) as well as in streaming (processing data in real-time) mode.

Why should you collect logs? 🤔

An application can be made of thousands of services, all running on multiple servers and each producing huge volumes of log data.

To analyze the logs for entire application, you need to collect logs from various servers the application is running on and bring them to a single place.

That's where Flume comes in. Flume collects log data from a variety of sources and brings them into HDFS (Hadoop's file system) to be aggregated and analyzed. 🏘

What are some of the other common use cases of Flume?

Webpages generate a lot of events and click-stream data, which has to be analyzed to understand usage patterns. Flume can bring such data into a single place for further analysis.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Apache Flume and data pipelines
  2. Sqoop vs Flume: Battle of the Hadoop ETL tools