The web content provides a comprehensive tutorial on setting up a Change Data Capture (CDC) pipeline using Kafka, Kafka Connect, Debezium Connector, and MongoDB with Docker-compose.
Abstract
The article offers a step-by-step guide for implementing a CDC pipeline to synchronize and analyze real-time data changes from a MongoDB database using Debezium with Kafka and Kafka Connect. It outlines the prerequisites, explains the role of Debezium in CDC, and details the setup process for Kafka, Kafka Connect, and a MongoDB replica set using Docker-compose. The tutorial also covers the creation of a Debezium MongoDB Connector, including configuration and verification of real-time data capture through topic creation and event tracking in Kafka. The article concludes with instructions for cleanup and provides references to further documentation and related articles, encouraging readers to monitor their Debezium connectors and engage with the author's Medium content.
Opinions
The author emphasizes the importance of CDC in modern digital infrastructures for real-time data synchronization and analysis.
Debezium is highlighted as a key open-source tool for capturing and streaming database changes to systems like Apache Kafka.
The author suggests that using Debezium with MongoDB requires a MongoDB replica set or sharded cluster, as standalone servers lack the necessary oplog for change monitoring.
The article promotes the use of Docker-compose for orchestrating the necessary services, including Kafka, Kafka Connect, and MongoDB instances.
Akhq, a Kafka GUI, is recommended for managing Kafka topics, consumer groups, and connectors, indicating a preference for user-friendly interfaces in managing Kafka ecosystems.
The author provides a personal GitHub repository with code examples, indicating a commitment to sharing practical knowledge and resources with the community.
Monitoring the Debezium connector is presented as a critical aspect of maintaining a CDC pipeline, with a suggestion to use Prometheus and Grafana for this purpose.
The author invites reader interaction by asking for feedback and sharing in the comments, fostering a sense of community and ongoing learning.
A promotional note encourages readers to try an AI service recommended by the author, positioned as a cost-effective alternative to ChatGPT Plus (GPT-4).
Hands on: CDC Debezium + Kafka Connect + MongoDB
A step-by-step tutorial for setting up Change Data Capture with Debezium on a MongoDB and Kafka only with Docker-compose
In the modern digital infrastructures, Change Data Capture (CDC) emerges as a crucial mechanism, enabling real-time data synchronization and analysis across diverse ecosystems.
In this article we are gonna see step by step how to implement a change Data Capture pipeline using Kafka, Kafka Connect, Debezium Connector and MongoDB.
Prerequisites
Docker, Docker Compose
For windows users, you can replace “localhost” by “host.docker.internal” to connect to services running on the Windows-host from inside a Docker-container.
CDC using Debezium ?
Change Data Capture (CDC) using Debezium refers to a technique that captures and monitors changes made to a database in real-time.
Debezium is an open-source distributed platform that provides a framework for capturing row-level changes from various databases and streaming them to other systems, such as Apache Kafka.
By utilizing Debezium, organizations can implement CDC to track and propagate changes, enabling real-time data synchronization, analysis, and integration across disparate systems.
Debezium MongoDB Connector
Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium MongoDB Connector is used to capture and stream changes from MongoDB databases in real-time.
PS:
MongoDB Connector supports only a MongoDB replica set or a MongoDB sharded cluster. It is not capable of monitoring the changes of a standalone MongoDB server, since standalone servers do not have an oplog ( see this story about oplog ). The connector will work if the standalone server is converted to a replica set with one member.
Setup Kafka + Kafka Connect + MongoDB
With the following docker-compose:
We create the following services:
Akhq: Kafka GUI for Apache Kafka to manage topics, topics data, consumers group, schema registry, connect and more…
Zookeeper: we install zookeeper to use it with kafka.
Kafka: is an open-source distributed event streaming platform designed to handle real-time data feeds and processing pipelines.
Kafka connect: is a tool for streaming data between Apache Kafka and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. We use it to add the debezium MongoDB connector.
Three MongoDB servers used as a replicat set with 3 members: in order to use Debezium with MongoDb, we need a replicat set. Here we are creating a replica set with 3 instances. We are also going to expose each of them to our local machine, so that we can access them using the Mongo shell interface from our local machine. Each of the three Mongo containers should be able to communicate with all other containers in the network. finally, we define the initialization server that runs the `rs.initiate` command to intialize the replica set and connect the three servers to each other.
We download the connector jar and add it to Kafka Connect’s plugin path (see above in the docker-compose, and code GitHub repository).
Using the connector
To add the connetor, we are using the interface Akhq:
Create the debezium mongodb connector
We are gonna specify the following configs:
name ( The name of the connector and should be unique)= “test”
topic.prefix ( It identifies and provides a namespace for the particular database server/cluster is capturing changes. The topic prefix should be unique) = “entreprise”
collection.include.list ( A list of regular expressions that match the collection names for which changes are to be captured ) = “test.entreprise”
database.include.list ( A list of regular expressions that match the database names for which changes are to be captured ) = “test”