What is Apache Impala?

Apache Impala is an open-source SQL query engine for processing large volumes of data stored in Hadoop clusters (aka where Hadoop stores its data—HDFS, HBase or even an Amazon S3 bucket).

Impala's MPP architecture allows fast computations against data stored in a Hadoop cluster using SQL. It combines the experience of a tradition analytics database with the scalability of Apache Hadoop. (It means you can analyze lots of data quickly, using our friendly neighborhood SQL! 😍🚀)

What does Impala add to the Apache Hadoop ecosystem?

Impala can communicate directly with storage engines using Apache HBase and HDFS (Hadoop's file system). It uses the same file formats (like Parquet, Avro, RCFile) that are used by Hadoop. As a result, Impala avoids having to transport data into specialized systems/formats for analytics and seamlessly integrates with the Hadoop cluster.

Impala also uses the same metadata engines, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software. All data is immediately query-able, with no delays for ETL.

See also

  1. Apache HBase
  2. Apache Hadoop

Articles you might be interested in

  1. Apache Impala overview
  2. implyr: R interface for Apache Impala