Summary

This article provides a guide on setting up a local big data cluster using popular tools like Apache Spark, Hadoop Distributed File System (HDFS), Apache Kafka, and Apache Airflow on a local environment using Docker Compose.

Abstract

The article titled "How to create a Big Data Cluster with Spark, HDFS, Kafka and Airflow" explains the process of setting up individual clusters for each tool using Docker Compose and establishing seamless communication between them. The tools covered in the article are Apache Spark, Hadoop Distributed File System (HDFS), Apache Kafka, and Apache Airflow. The article provides step-by-step instructions on setting up each cluster, including the use of custom networks and exposed ports to make the clusters accessible to each other. The article also includes sample jobs and workflows to test the communication between the different components.

Bullet points

The objective of the article is to create individual clusters for each tool using Docker Compose and establish seamless communication between them.
The Spark cluster will be composed of a Spark master and a Spark worker using the Spark image from bitnami.
The Spark cluster will be accessible to the rest of the tools by creating a custom network and exposing the ports 8080 & 7077.
The HDFS cluster will be composed of a namenode and a datanode using the Hadoop image from bde2020.
The HDFS cluster will be accessible to the rest of the tools by creating a custom network and exposing the ports 9870 & 9000.
The Kafka cluster will be composed of a Zookeeper and a Kafka service using the official Confluent image.
The Kafka cluster will be accessible to the rest of the tools by exposing the port 2181 for Zookeeper and the port 9092 for Kafka.
The Airflow cluster will be built using a custom image from its official image and will be accessible to the rest of the tools by exposing the 8080 port to the 8090 port.
The article includes sample jobs and workflows to test the communication between the different components.
The code samples and the complete code can be found on the author's GitHub repository.

How to create a Big Data Cluster with Spark, HDFS, Kafka and Airflow

In the world of big data, there are many tools, and sometimes, when wanting to test how different technologies could work together before putting them in production, it turns into a nightmare. This article will guide readers through the process of setting up a local big data cluster using popular tools like Apache Spark, Hadoop Distributed File System (HDFS), Apache Kafka, and Apache Airflow on a local environment using Docker Compose.

Setting up Docker Compose Clusters

The objective here is to create individual clusters for each tool — Apache Spark, HDFS, Kafka, and Apache Airflow — using Docker Compose. The goal is not only to set up these clusters independently but also to establish seamless communication between them.

1. Apache Spark Cluster

The Spark cluster will be composed of a Spark master and a Spark worker. To do so, you will have to use the Spark image from bitnami, and start a docker-compose with two services, one for the master and the other for the worker. Also, to make the Spark cluster accessible to the rest of tools, you have to create a custom network and expose the ports 8080 & 7077.

Once you start the cluster, you can access the spark UI from http://localhost:8080.

2. Hadoop Distributed File System (HDFS)

In the case of HDFS, it is a bit more complex to set up a cluster. First, you will have to define a config file named config with the following content:

CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_fs_default_name=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
CORE_CONF_ipc_maximum_data_length=134217728

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false

Then, you can start another docker-compose with two services using this image. One service will be a namenode and the other one a datanode, with the namenode 9870 & 9000 ports exposed and using the custom network previously created for Spark.

Once you start the cluster, you can access the namenode UI from http://localhost:9870

3. Apache Kafka Cluster

The Kafka cluster will be composed of a Zookeeper and a Kafka service, using the official Confluent image. You will need to expose the port 2181 for Zookeeper, and the port 9092 for Kafka. Finally, use the custom network to start the Kafka cluster. After doing so, you can send events to kafka at kafka:9092.

4. Apache Airflow

To start an Airflow cluster, we will build a custom image from its official image. To do so, we need to define the following requirements.txt file:

apache-airflow-providers-apache-spark==4.1.1
hdfs==2.7.3

And then we can create our custom image:

Finally, we can start our Airflow cluster using a standalone instance, exposing the 8080 port to the 8090 port. Once started, you can access Airflow from http://localhost:8090, and log in with user=admin and password=admin

Sample Jobs and Workflows

Once you successfully started Spark, HDFS, Kafka and Airflow, you can create a sample ETL and test how the different components communicate between them. For instance, you could create an Airflow DAG that uploads data to HDFS, runs a Spark job and writes output data to Kafka and HDFS.

Here there is a code sample, but you can find all the code on this github repo!

Now I’d like to hear from you, let me know how it worked!

Thanks for reading!