avatarRitam Mukherjee

Summary

The website content outlines a method for real-time fraud detection in financial transactions using Apache Kafka for data ingestion and Spark Streaming for processing and analysis.

Abstract

The article presents a comprehensive guide on constructing a real-time fraud detection system for financial transactions by leveraging the capabilities of Apache Kafka and Spark Streaming. It begins by emphasizing the importance of such a system due to the high volume of transactions and the need for immediate fraud identification. The setup involves Kafka as a data source for streaming transaction data, Spark Streaming for applying fraud detection rules in Scala, and an alerting system to record flagged transactions. The author provides code snippets for each step, including setting up Kafka, simulating transaction data, and defining the Spark Streaming pipeline with fraud detection logic. The system is designed to flag transactions that are unusually large, rapidly occurring, or high-risk international transactions. The article concludes with instructions for testing the pipeline, scaling it for production environments, and suggestions for further enhancements, such as integrating machine learning models.

Opinions

  • The author expresses enthusiasm about the effectiveness of combining Kafka and Spark Streaming for fraud detection, suggesting it is both straightforward and scalable.
  • There is an emphasis on the practicality of the provided example, with encouragement for readers to try implementing the system in their own contexts.
  • The author believes that real-time analytics can significantly improve the response to fraud in the financial industry, which is crucial for both fintech startups and established institutions.
  • The article suggests that the example given is just a starting point and can be extended with more sophisticated rules and machine learning models for improved accuracy.
  • The author values reader engagement and feedback, as indicated by the invitation for claps and follows on Medium and LinkedIn.

Real-Time Use-case : Fraud Detection in Financial Transactions with Kafka and Spark Streaming

Leveraging Kafka and Spark Streaming for Real-Time Fraud Detection

Non-members can access the full article through this Link.

Photo by Bermix Studio on Unsplash

Introduction

Let’s face it — fraud is a big problem for financial institutions. With a staggering number of transactions happening every second, spotting fraudulent activity before it impacts customers and the business is crucial. That’s where real-time analytics comes into picture, and we’re diving into how-to-build a streamlined fraud detection system using Apache Kafka and Spark Streaming.

Imagine monitoring transaction data in real time, identifying suspicious activity, and flagging potentially fraudulent transactions instantly. In this article, we’ll use Kafka to stream transaction data and Spark Streaming in Scala to detect fraud patterns as they occur.

Excited ?? Let’s jump right in!

Why Kafka and Spark?

Kafka is an amazing tool for ingesting real-time data and handling millions of messages per second. Spark Streaming, on the other hand, is excellent for fast, scalable data — processing. Together, they’re a perfect duo for building real-time fraud detection systems that require low latency, high throughput, and quick insights.

The Setup

Let’s start with a high-level picture of what we’ll build:

  1. Kafka will act as the data source, continuously streaming financial transactions.
  2. Spark Streaming will process the data, applying rules to detect suspicious activities.
  3. Alerting System (like an external API or database) where flagged transactions will be recorded.

Use Case: Detecting Suspicious Transactions

Our use case is straightforward: monitor transactions for patterns that suggest fraud. For example:

  • Large withdrawals
  • Rapid transactions from the same account within specified duration
  • High-risk international transactions

Now, let’s get into the code!!

Step 1: Set Up Kafka

Assuming Kafka is running locally, we’ll start by creating a topic named transactions. This is where we’ll publish our financial transaction data.

kafka-topics.sh --create --topic transactions --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Step 2: Simulate Transaction Data

We’ll need a data producer to generate mock transaction data. In a real — world scenario, this data would come from your payment processing system; but here we can use a simple Scala producer to push data into Kafka:

import org.apache.kafka.clients.producer._
import java.util.Properties
import scala.util.Random

//Properties for KafkaProducer
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("acks", "all"); // Most durable setting, producer waits for full commit of the KafkaRecord from the broker 
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

val producer = new KafkaProducer[String, String](props)

//Code for transaction stream simulation
while (true) {
  val transaction = s"""{
    "account_id": "${Random.nextInt(1000)}",
    "amount": ${Random.nextDouble() * 20000},
    "location": "${if (Random.nextBoolean()) "domestic" else "international"}",
    "timestamp": ${System.currentTimeMillis()}
  }"""

//Sending ProducerRecords to the prodcuer for topic - transactions
  producer.send(new ProducerRecord[String, String]("transactions", null, transaction))
  Thread.sleep(100)
}

Step 3: Spark Streaming for Fraud Detection

Let’s get into Spark Streaming. The code below reads from the transactions Kafka topic, deserializes the data, and applies simple fraud detection rules(explain later in this article).

Step 3.1: Set Up Spark Session

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val spark = SparkSession.builder
  .appName("FraudDetection")
  .master("local[*]")
  .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

Step 3.2: Define the Schema

We’ll define a schema for our transaction data, which will make it easier to parse JSON into DataFrame rows.

val transactionSchema = new StructType()
  .add("account_id", StringType)
  .add("amount", DoubleType)
  .add("location", StringType)
  .add("timestamp", LongType)

Step 3.3: Stream Data from Kafka

Here, we connect to Kafka and read transaction data, parsing it as JSON.

val kafkaDf = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "transactions")
  .load()

val transactions = kafkaDf
  .selectExpr("CAST(value AS STRING) as json")
  .select(from_json(col("json"), transactionSchema).as("data"))
  .select("data.*")

“Enjoying the insights so far? If so, please tap the clap 👏 icon! It helps me know you’re finding value here. And if you’d like to stay updated with more content like this, follow me on Medium and LinkedIn

Step 3.4: Apply Fraud Detection Rules

Now, we’ll add simple fraud detection rules. In this case, let’s flag transactions that:

  1. Exceed $10,000.
  2. Occur too rapidly (within 1 minute of another transaction from the same account).
  3. Are international and exceed $5,000.
val flaggedTransactions = transactions
  .withColumn("is_large_withdrawal", col("amount") > 10000)
  .withColumn("is_high_risk_location", col("location") === "international" && col("amount") > 5000)
  
// Rule for rapid transactions from the same account within 1 minute
val rapidTransactions = transactions
  .groupBy(window(col("timestamp"), "1 minute"), col("account_id"))
  .count()
  .filter("count > 1")
  .select( col("account_id"), 
    col("window.start").as("window_start"), 
    col("window.end").as("window_end"),
    col("count") 
   )


// Get final suspicious transactions
val suspiciousTransactions = flaggedTransactions.alias("flagged")  
  .join( rapidTransactions.alias("rapid"),
    col("flagged.account_id") === col("rapid.account_id") &&
    col("flagged.timestamp").between(col("rapid.window_start"), col("rapid.window_end")), 
    "left_outer" // left_outer join ensures other transactions from flaggedTransactions are retained 
  )
  .filter(col("is_large_withdrawal") || col("is_high_risk_location") || col("count").isNotNull) // count present means its flagged in rapidTransactions
  .select(
    col("flagged.account_id"),
    col("flagged.amount"),
    col("flagged.location"),
    col("flagged.timestamp")
  )

Step 3.5: Write Output

You could output suspicious transactions to the console, a database, or trigger an alert.

a. Viewing the write stream in console —

val query = suspiciousTransactions.writeStream
  .outputMode("append")
  .format("console")
  .start()

query.awaitTermination()

b.Sending to DataBase for future analysis —

val query = suspiciousTransactions.writeStream
  .foreachBatch { (batchDF, batchId) =>
    batchDF.write
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/fraud_db")
      .option("dbtable", "suspicious_transactions")
      .option("user", "username")
      .option("password", "password")
      .mode("append")
      .save()
  }
  .start()

query.awaitTermination()

c. Writing the rows to an Alerting System —

val query = suspiciousTransactions.writeStream
  .foreachBatch { (batchDF, batchId) =>
    batchDF.collect().foreach { row =>
      AlertingSystem.sendAlert(row) // Replace this with your alerting logic
    }
  }
  .start()

query.awaitTermination()

Step 4: Testing the Pipeline

To test the pipeline:

  1. Start the Kafka producer script to publish transactions.
  2. Run the Spark Streaming job to consume data and flag suspicious transactions.

Watch the console output to see if transactions are flagged. If you see outputs like:

+----------+-------+-----------+-------------+
|account_id|amount |location   |timestamp    |
+----------+-------+-----------+-------------+
|102       |11000.0|domestic   |1679926400000|
|540       |7000.0 |international|1679926460000|
+----------+-------+-----------+-------------+

These transactions meet one or more fraud criteria.

Scaling It Up

We need to keep in mind that these example works well on a small scale, but here are some tips for scaling up in a production environment:

  • Kafka Partitioning: Increase partitions on the transactions topic to support parallelism.
  • Increase Topic Replication Factor to make the system more reliable.
  • Checkpoints and State Management: Use Spark’s checkpointing to maintain state across batches.
  • Alerting: Integrate with notification services for immediate fraud alerts.
  • Model-Based Detection: Consider applying machine learninng models in Spark to detect fraud more accurately.

Wrapping Up!

With Kafka and Spark Streaming, building a real-time fraud detection pipeline is straightforward, scalable, and incredibly effective. This example scratches the surface — you can extend it with more complex rules, machine learning models, or integrations with alerting systems.

Whether you’re a fintech startup or a large financial institution, real-time analytics can transform how you respond to fraud.

So, give it a try, and happy coding!

Thank you for taking the time to read my article! If you found this useful, your claps 👏 would motivate me to keep on writing such valuable content. You can also follow me on Medium and LinkedIn to stay connected and catch all my latest insights.

Further Reading

You may also like some of my below articles —

#ApacheSpark #DataPipeline #DataEngineering #Kafka #Streaming #FraudDetection

Spark
Kafka
Data Engineering
Machine Learning
Data
Recommended from ReadMedium