Real-Time Stream Processing Using Flink and Kafka with Scala and Zeppelin (Part 1): Installations

Provides a comprehensive and detailed description of the requirements and installation procedures for each service, ensuring an overall understanding of each one.

INTRODUCTION

In this installation guide, a step-by-step process of installing a suite of essential services related to stream data processing on Ubuntu 22.04 system will be significantlly discussed. We will cover the installation of six different services, including Java 1.8.0_382, Scala 2.11.12, sbt 1.9.3, Apache Flink 1.14.3, Apache Kafka 3.5.0, and Apache Zeppelin. In general, these services form the backbone of modern data engineering and stream processing solutions, enabling us to handle real-time data streams efficiently and effectively. By the end of this tutorial, we will have a well-prepared environment ready to tackle a wide range of data stream processing challenges. Moreover, in the second part of this article, a real data processing case is accurately implemented. Let’s get started with the installation procedures to ensure your system is equipped with the necessary tools for your stream processing projects.

DEFINITIONS

Java

Java, initially introduced by Sun Microsystems in 1995, serves as both a programming language and a computing platform. Over the years, it has transformed from its modest origins into a dominant force in today’s digital landscape, serving as a dependable foundation for numerous services and applications. Even as we move towards the future, Java remains a crucial component for the development of innovative products and digital services.

Scala

Scala is a contemporary programming language with a multi-paradigm approach, crafted to succinctly convey familiar programming concepts in an elegant and type-safe manner. It combines elements from both object-oriented and functional programming languages.

Simple Build Tool (SBT)

SBT is used for projects in both Scala and Java software languages, making it the preferred tool for 93.6% of Scala developers in 2019. One notable Scala-specific feature it offers is the capability to cross-compile project across various Scala versions.

Apache Flink

Apache Flink is a framework and distributed processing engine which is tailored for stateful computations on both infinite and finite data streams. Flink’s architecture is built to operate in various cluster environments, enabling high-speed, in-memory computations at any level of scalability.

Apache Kafka

Apache Kafka is an open-source event streaming platform that is widely adopted by numerous organizations for high-performance data pipelines, streaming analytics, data integration, and crucial applications.

Apache Zeppelin

Apache Zeppelin can be considered as a web-based notebook that enables users to do data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.

INSTALLATIONS

I. Java

The installation process for Java involves the following sequential steps

Go to the home directory

cd

2. Update the package lists

sudo apt-get update

3. Install Java

sudo apt install openjdk-8-jdk

4. Prepare the enviroment variables for Java

nano .bashrc

# Put the following lines at the end of the file, you can go to the end of the file using (alt+/) shortcut.

export JAVA_HOME=/lib/jvm/java-8-openjdk-amd64

export PATH=$PATH:$JAVA_HOME/bin

4. Apply changes to .basrch file

source .bashrc

5. Ensure that the JAVA_HOME variable is correctly configured

echo $JAVA_HOME

6. Ensure that you have the correct Java version

java -version

II. Scala

The sequential steps for installing Scala are as follows:

Go to the home directory

cd

2. Update the package lists

sudo apt-get update

3. Install Scala

sudo apt-get install scala

4. Ensure that you have the correct Scala version

scala -version

III. SBT

The installation of SBT can be completed through the following steps

Go to the home directory

cd

2. Update the package lists

sudo apt-get update

3. Install SBT

sudo apt-get install sbt

4. Ensure that you have the correct SBT version

sbt -version

IV. Apache Flink

In this tutorial, Apache Flink 1.14.3 version is installed as the recent versions have bugs when they are run on Apache Zeppelin. To guide you through the process, here are the sequential steps for its installation:

Go to the home directory

cd

2. Download Apache Flink binary file from its offical website

wget https://archive.apache.org/dist/flink/flink-1.14.3/flink-1.14.3-bin-scala_2.11.tgz

3. Unzip Flink binary file and move it to another file

tar xzf flink-1.14.3-bin-scala_2.11.tgz

mv flink-1.14.3/ flink14

4. Modify flink-conf.yaml file to be able to reach Flink User Interface

nano flink14/conf/flink-conf.yaml

# Change or activate the following lines

rest.port: 8081

rest.address: localhost

rest.bind-adress: 0.0.0.0

5. Go to the home directory and obtain <your home path>, which will be used in the next step

cd && pwd

6. Go to the home directory and add Flink path to bashrc file

cd && nano .bashrc

# Put the following lines at the end of the file, you can go to the end of the file using (alt+/) shortcut. It is important to note that <your home path> must be changed with the obtained result from the 5th step.

export FLINK_HOME=<your home path>/flink14

export PATH=$PATH:$FLINK_HOME/bin

7. Apply changes to .basrch file

source .bashrc

8. Ensure that the FLINK_HOME variable is correctly configured

echo $FLINK_HOME

9. Ensure that you have the correct Apache Flink version

flink --version

10. Run Apache Flink

flink/bin/start-cluster.sh

11. Access Apache Flink Interface in your web browser at (localhost:8081)

12. Establish connectivity between Apache Flink and Apache Kafka by installing the following dependencies

cd

cd flink14/lib

wget https://repo1.maven.org/maven2/org/apache/flink/flink-core/1.14.3/flink-core-1.14.3.jar

wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka_2.11/1.14.3/flink-connector-kafka_2.11-1.14.3.jar

wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.0/kafka-clients-3.2.0.jar

V. Apache Kafka

The sequential steps for installing Apache Kafka are as follows:

Go to the home directory

cd

2. Download Apache Kafka binary file from its offical website

wget https://downloads.apache.org/kafka/3.5.0/kafka_2.12-3.5.0.tgz

3. Unzip the archive file and move to another location

tar xzf kafka_2.12-3.5.0.tgz

mv kafka_2.12-3.5.0 /opt/kafka

4. Create the systemd unit file for zookeeper service

nano  /etc/systemd/system/zookeeper.service

/etc/systemd/system/zookeeper.service
[Unit]
Description=Apache Zookeeper service
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

5. Reload the daemon to take effect

systemctl daemon-reload

6. Create the systemd unit file for kafka service

nano /etc/systemd/system/kafka.service

/etc/systemd/system/kafka.service
[Unit]
Description=Apache Kafka Service
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service

[Service]
Type=simple
Environment="JAVA_HOME=/lib/jvm/java-8-openjdk-amd64"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh

[Install]
WantedBy=multi-user.target

7. Reload the daemon again

systemctl daemon-reload

8. Start zookeeper service

systemctl start zookeeper

9. Ensure that Zookeeper service is running

systemctl status zookeeper

10. Start the kafka service

systemctl start kafka

11. Ensure that Apache kafka service is running

systemctl status kafka

VI. Apache Zeppelin

The sequential steps for installing Apache Zeppelin are as follows:

Go to the home directory

cd

2. Download Apache Zeppelin binary file from its offical website

wget https://dlcdn.apache.org/zeppelin/zeppelin-0.10.1/zeppelin-0.10.1-bin-all.tgz

3. Unzip Zeppelin binary file and move it to another file

tar xzf zeppelin-0.10.1-bin-all.tgz

mv zeppelin-0.10.1-bin-all/ zeppelin

4. Modify Apache Zeppelin configuration file (zeppelin-site.xml) to enable access to its interface. It’s important to note that Apache Zeppelin port has been changed to 8082, as port 8080 may already be in use by Jupyter Notebook.

cd zeppelin/conf

cp zeppelin-site.xml.template zeppelin-site.xml

nano zeppelin-site.xml

Modify the following lines as below;

zeppelin.server.addr 0.0.0.0 Server binding address

zeppelin.server.port 8082 Server port.

6. Run Apache Zeppelin

cd ..

cd bin

./zeppelin-daemon.sh start

8. Access Apache Zeppelin Interface in your web browser at (localhost:8082)

9. Integrate the (FLINK_HOME) variable into the Apache Zeppelin interpreter settings to facilitate access to Apache Flink. You can get it using the following command

echo $FLINK_HOME

10. Restart Apache Zeppelin

cd

cd zeppelin/bin

./zeppelin-daemon.sh restart

It is so important to notice that Apache flink must not be running in the background as Apache Zeppelin will automatically run it. Otherwise, you will encounter a “cannot open Flink interpreter” error. Therefore, stop Apache Flink before start using it on Apache Zeppelin. You can stop it using the following commands:

Go to Apache Flink bin directory

cd

cd flink14/bin

2. Stop Apache Flink service

./stop-cluster.sh

CONCLUSION

In conclusion, we have successfully completed the installation of crucial components for building a robust data processing environment on Ubuntu 22.04 by installing Java, Scala, sbt, Apache Flink, Apache Kafka, and Apche Zeppelin services. The accessability between them has also been enabled. By installing these services we have laid the foundation of a real-time data processing system that can be used for various purposes. However, our journey does not end here. In the second part of this article, we will dive into a real data processing case that includes a practical application for these services. We will have the opportunity to see how these tools work together in a real-world scenario, enabling us to gain valuable insights and skills.

REFERENCES

Apache Flink. (n.d). What is Apache Flink? — Architecture. Accessed on [04.09.2023]. Retrieved from https://flink.apache.org/what-is-flink/flink-architecture/

Java. (n.d). What is Java technology and why do I need it? Accessed on [04.09.2023]. Retrieved from https://www.java.com/en/download/help/whatis_java.html

Scala. (n.d). TOUR OF SCALA. Accessed on [04.09.2023]. Retrieved from https://docs.scala-lang.org/tour/tour-of-scala.html

Kafka (n.d). APACHE KAFKA. Accessed on [04.09.2023]. Retrieved from https://kafka.apache.org/

sbt (n.d). The interactive build tool. Accessed on [04.09.2023]. Retrieved from https://www.scala-sbt.org/

Zeppelin (n.d). Apache Zeppelin. Accessed on [04.09.2023]. Retrieved from https://zeppelin.apache.org/

Summarize

Real-Time Stream Processing Using Flink and Kafka with Scala and Zeppelin (Part 1): Installations

I. Java

II. Scala

III. SBT

IV. Apache Flink

V. Apache Kafka

VI. Apache Zeppelin