What is Spark?

Apache Spark is an open-source, distributed computing framework for processing and analyzing big data.

From loading, transforming and querying data to machine learning and stream processing, it's a one-stop shop for working with big data.🤩

What are the different components of Spark?

Apache Spark consists of a Spark Core data processing engine—the heart of the entire framework. On top of the core, there are libraries for SQL, machine learning, graph processing and stream processing.

Apache Spark ecosystem

1. Spark Core

Spark Core is the data processing engine of Apache Spark—the basis of the entire Spark project. Using Spark Core, you can load data from various sources and formats, and then transform and process the data.

It has a large set of libraries and APIs and supports several programming languages (Java, Scala, Python, R, SQL). Since it supports several languages, everyone from developers and data engineers to data scientists can easily work with it.

Did you know that you can reduce almost 50 lines of MapReduce code for counting words in a file to just a few lines of Apache Spark? 😱

2. Spark SQL

The humans of data, from data analysts and scientists to business users depend on SQL to explore and analyze data. With Spark SQL, you can query and process structured data using either SQL or HiveQL.

Spark SQL supports SQL libraries and uses the DataFrame approach (see Pandas) to work with data.

3. Spark Streaming

You can use Spark Streaming to process real-time streaming data for continuous applications—end-to-end applications that interact with data in real-time.

4. MLlib

Spark also includes MLlib, which is a library of machine learning algorithms such as classification, regression, cluster analysis methods, feature extraction and more. In addition to machine learning, Spark also supports several third-party libraries, which you can find here.

5. GraphX

Spark GraphX is a distributed graph processing framework that helps build and transform graph data structures at scale.

How is Spark different from Hadoop?

Unlike Hadoop, Spark doesn't store data permanently as it doesn't have its own file system. However, you can use Spark with several data storage repositories such as cloud storage systems (Amazon S3), distributed file systems (HDFS) and NoSQL databases (Cassandra).

Compared to Hadoop's MapReduce, Spark Core is simpler, faster and supports a wide variety of programming languages (Java, Python, R, Scala).

Who uses Spark?

Spark is ideal for almost all big data applications. Its applications include machine learning, real-time analytics and graph processing and is used by organizations such as Amazon, IBM, Netflix, eBay and Yahoo!.

Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏

See also

Articles you might be interested in

  1. Spark: The definitive guide by Matei Zaharia, Bill Chambers
  2. What is Apache Spark?
  3. Spark or Hadoop — Which is the best big data framework?