How to create a Big Data Cluster with Spark, HDFS, Kafka and Airflow
In the world of big data, there are many tools, and sometimes, when wanting to test how different technologies could work together before putting them in production, it turns into a nightmare. This article will guide readers through the process of setting up a local big data cluster using popular tools like Apache Spark, Hadoop Distributed File System (HDFS), Apache Kafka, and Apache Airflow on a local environment using Docker Compose.
Setting up Docker Compose Clusters
The objective here is to create individual clusters for each tool — Apache Spark, HDFS, Kafka, and Apache Airflow — using Docker Compose. The goal is not only to set up these clusters independently but also to establish seamless communication between them.
1. Apache Spark Cluster
The Spark cluster will be composed of a Spark master and a Spark worker. To do so, you will have to use the Spark image from bitnami, and start a docker-compose with two services, one for the master and the other for the worker. Also, to make the Spark cluster accessible to the rest of tools, you have to create a custom network and expose the ports 8080 & 7077.
Once you start the cluster, you can access the spark UI from http://localhost:8080.





