Real-Time Stream Processing Using Flink and Kafka with Scala and Zeppelin (Part 1): Installations
Provides a comprehensive and detailed description of the requirements and installation procedures for each service, ensuring an overall understanding of each one.
INTRODUCTION
In this installation guide, a step-by-step process of installing a suite of essential services related to stream data processing on Ubuntu 22.04 system will be significantlly discussed. We will cover the installation of six different services, including Java 1.8.0_382, Scala 2.11.12, sbt 1.9.3, Apache Flink 1.14.3, Apache Kafka 3.5.0, and Apache Zeppelin. In general, these services form the backbone of modern data engineering and stream processing solutions, enabling us to handle real-time data streams efficiently and effectively. By the end of this tutorial, we will have a well-prepared environment ready to tackle a wide range of data stream processing challenges. Moreover, in the second part of this article, a real data processing case is accurately implemented. Let’s get started with the installation procedures to ensure your system is equipped with the necessary tools for your stream processing projects.
DEFINITIONS
- Java
Java, initially introduced by Sun Microsystems in 1995, serves as both a programming language and a computing platform. Over the years, it has transformed from its modest origins into a dominant force in today’s digital landscape, serving as a dependable foundation for numerous services and applications. Even as we move towards the future, Java remains a crucial component for the development of innovative products and digital services.
- Scala
Scala is a contemporary programming language with a multi-paradigm approach, crafted to succinctly convey familiar programming concepts in an elegant and type-safe manner. It combines elements from both object-oriented and functional programming languages.
- Simple Build Tool (SBT)
SBT is used for projects in both Scala and Java software languages, making it the preferred tool for 93.6% of Scala developers in 2019. One notable Scala-specific feature it offers is the capability to cross-compile project across various Scala versions.
- Apache Flink
Apache Flink is a framework and distributed processing engine which is tailored for stateful computations on both infinite and finite data streams. Flink’s architecture is built to operate in various cluster environments, enabling high-speed, in-memory computations at any level of scalability.
- Apache Kafka
Apache Kafka is an open-source event streaming platform that is widely adopted by numerous organizations for high-performance data pipelines, streaming analytics, data integration, and crucial applications.
- Apache Zeppelin
Apache Zeppelin can be considered as a web-based notebook that enables users to do data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
INSTALLATIONS
I. Java
The installation process for Java involves the following sequential steps
- Go to the home directory
cd
2. Update the package lists
sudo apt-get update
3. Install Java
sudo apt install openjdk-8-jdk
4. Prepare the enviroment variables for Java
nano .bashrc
# Put the following lines at the end of the file, you can go to the end of the file using (alt+/) shortcut.
export JAVA_HOME=/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
4. Apply changes to .basrch file
source .bashrc
5. Ensure that the JAVA_HOME variable is correctly configured
echo $JAVA_HOME
6. Ensure that you have the correct Java version
java -version
II. Scala
The sequential steps for installing Scala are as follows:
- Go to the home directory
cd
2. Update the package lists
sudo apt-get update
3. Install Scala
sudo apt-get install scala
4. Ensure that you have the correct Scala version
scala -version
III. SBT
The installation of SBT can be completed through the following steps
- Go to the home directory
cd
2. Update the package lists
sudo apt-get update
3. Install SBT
sudo apt-get install sbt
4. Ensure that you have the correct SBT version
sbt -version
IV. Apache Flink
In this tutorial, Apache Flink 1.14.3 version is installed as the recent versions have bugs when they are run on Apache Zeppelin. To guide you through the process, here are the sequential steps for its installation:
- Go to the home directory
cd
2. Download Apache Flink binary file from its offical website
wget https://archive.apache.org/dist/flink/flink-1.14.3/flink-1.14.3-bin-scala_2.11.tgz
3. Unzip Flink binary file and move it to another file
tar xzf flink-1.14.3-bin-scala_2.11.tgz
mv flink-1.14.3/ flink14
4. Modify flink-conf.yaml file to be able to reach Flink User Interface
nano flink14/conf/flink-conf.yaml
# Change or activate the following lines
rest.port: 8081
rest.address: localhost
rest.bind-adress: 0.0.0.0
5. Go to the home directory and obtain <your home path>, which will be used in the next step
cd && pwd
6. Go to the home directory and add Flink path to bashrc file
cd && nano .bashrc
# Put the following lines at the end of the file, you can go to the end of the file using (alt+/) shortcut. It is important to note that <your home path> must be changed with the obtained result from the 5th step.
export FLINK_HOME=<your home path>/flink14
export PATH=$PATH:$FLINK_HOME/bin
7. Apply changes to .basrch file
source .bashrc
8. Ensure that the FLINK_HOME variable is correctly configured
echo $FLINK_HOME
9. Ensure that you have the correct Apache Flink version
flink --version
10. Run Apache Flink
flink/bin/start-cluster.sh
11. Access Apache Flink Interface in your web browser at (localhost:8081)
12. Establish connectivity between Apache Flink and Apache Kafka by installing the following dependencies
cd
cd flink14/lib
wget https://repo1.maven.org/maven2/org/apache/flink/flink-core/1.14.3/flink-core-1.14.3.jar
wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka_2.11/1.14.3/flink-connector-kafka_2.11-1.14.3.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.0/kafka-clients-3.2.0.jar
V. Apache Kafka
The sequential steps for installing Apache Kafka are as follows:
- Go to the home directory
cd
2. Download Apache Kafka binary file from its offical website
wget https://downloads.apache.org/kafka/3.5.0/kafka_2.12-3.5.0.tgz
3. Unzip the archive file and move to another location
tar xzf kafka_2.12-3.5.0.tgz
mv kafka_2.12-3.5.0 /opt/kafka
4. Create the systemd unit file for zookeeper service
nano /etc/systemd/system/zookeeper.service
/etc/systemd/system/zookeeper.service
[Unit]
Description=Apache Zookeeper service
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
5. Reload the daemon to take effect
systemctl daemon-reload
6. Create the systemd unit file for kafka service
nano /etc/systemd/system/kafka.service
/etc/systemd/system/kafka.service
[Unit]
Description=Apache Kafka Service
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service
[Service]
Type=simple
Environment="JAVA_HOME=/lib/jvm/java-8-openjdk-amd64"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
[Install]
WantedBy=multi-user.target
7. Reload the daemon again
systemctl daemon-reload
8. Start zookeeper service
systemctl start zookeeper
9. Ensure that Zookeeper service is running
systemctl status zookeeper
10. Start the kafka service
systemctl start kafka
11. Ensure that Apache kafka service is running
systemctl status kafka
VI. Apache Zeppelin
The sequential steps for installing Apache Zeppelin are as follows:
- Go to the home directory
cd
2. Download Apache Zeppelin binary file from its offical website
wget https://dlcdn.apache.org/zeppelin/zeppelin-0.10.1/zeppelin-0.10.1-bin-all.tgz
3. Unzip Zeppelin binary file and move it to another file
tar xzf zeppelin-0.10.1-bin-all.tgz
mv zeppelin-0.10.1-bin-all/ zeppelin
4. Modify Apache Zeppelin configuration file (zeppelin-site.xml) to enable access to its interface. It’s important to note that Apache Zeppelin port has been changed to 8082, as port 8080 may already be in use by Jupyter Notebook.
cd zeppelin/conf
cp zeppelin-site.xml.template zeppelin-site.xml
nano zeppelin-site.xml
Modify the following lines as below;
6. Run Apache Zeppelin
cd ..
cd bin
./zeppelin-daemon.sh start
8. Access Apache Zeppelin Interface in your web browser at (localhost:8082)
9. Integrate the (FLINK_HOME) variable into the Apache Zeppelin interpreter settings to facilitate access to Apache Flink. You can get it using the following command
echo $FLINK_HOME
10. Restart Apache Zeppelin
cd
cd zeppelin/bin
./zeppelin-daemon.sh restart
It is so important to notice that Apache flink must not be running in the background as Apache Zeppelin will automatically run it. Otherwise, you will encounter a “cannot open Flink interpreter” error. Therefore, stop Apache Flink before start using it on Apache Zeppelin. You can stop it using the following commands:
- Go to Apache Flink bin directory
cd
cd flink14/bin
2. Stop Apache Flink service
./stop-cluster.sh
CONCLUSION
In conclusion, we have successfully completed the installation of crucial components for building a robust data processing environment on Ubuntu 22.04 by installing Java, Scala, sbt, Apache Flink, Apache Kafka, and Apche Zeppelin services. The accessability between them has also been enabled. By installing these services we have laid the foundation of a real-time data processing system that can be used for various purposes. However, our journey does not end here. In the second part of this article, we will dive into a real data processing case that includes a practical application for these services. We will have the opportunity to see how these tools work together in a real-world scenario, enabling us to gain valuable insights and skills.
REFERENCES
Apache Flink. (n.d). What is Apache Flink? — Architecture. Accessed on [04.09.2023]. Retrieved from https://flink.apache.org/what-is-flink/flink-architecture/
Java. (n.d). What is Java technology and why do I need it? Accessed on [04.09.2023]. Retrieved from https://www.java.com/en/download/help/whatis_java.html
Scala. (n.d). TOUR OF SCALA. Accessed on [04.09.2023]. Retrieved from https://docs.scala-lang.org/tour/tour-of-scala.html
Kafka (n.d). APACHE KAFKA. Accessed on [04.09.2023]. Retrieved from https://kafka.apache.org/
sbt (n.d). The interactive build tool. Accessed on [04.09.2023]. Retrieved from https://www.scala-sbt.org/
Zeppelin (n.d). Apache Zeppelin. Accessed on [04.09.2023]. Retrieved from https://zeppelin.apache.org/