avatarYalin Yener

Summary

The provided content is a comprehensive guide on how to build a simple data flow using Apache Nifi, including an explanation of its key concepts, a tutorial for creating a data flow that processes JSON data, and instructions for downloading and starting Nifi.

Abstract

The article titled "How to Build a Simple Data Flow with Apache Nifi" offers an overview of Apache Nifi as a robust tool for managing data ingestion and flow. It outlines the three essential components of data flow: the data itself, the infrastructure it traverses, and the people designing the flow. The author emphasizes Nifi's user-friendly web interface, which allows users to easily collect, transform, and manage data in real-time. The guide includes step-by-step instructions for downloading and starting Nifi on various operating systems, as well as an explanation of key Nifi concepts such as Processors, Process Groups, Templates, and Connections. A sample tutorial is provided, demonstrating how to create a data flow that fetches JSON data via HTTP, splits the data, searches for specific keywords, and outputs the results into separate files based on the content found. The article concludes with the author encouraging feedback and offering further engagement through a subscription to DDI Intel.

Opinions

  • The author believes that data flow complexity varies depending on the data, infrastructure, and the people involved in the design process.
  • Apache Nifi is presented as a powerful and flexible tool for data flow management, suitable for both beginners and advanced users due to its drag-and-drop interface and extensive library of processors.
  • The author suggests that organizing processors into Process Groups and using Templates can enhance the usability and reusability of Nifi flows.
  • The tutorial provided reflects the author's opinion that practical examples are an effective way to learn, as it guides readers through a real-world scenario of processing and categorizing JSON data.
  • The author's use of screenshots and detailed instructions indicates a commitment to making the learning process as clear and accessible as possible for the reader.
  • By inviting readers to contact them and subscribe to DDI Intel, the author expresses an openness to community engagement and a desire to provide ongoing expert insights in the field of data science.

How to Build a Simple Data Flow with Apache Nifi

In this post, i am going to explain brief information about Apache Nifi that is one of the most efficient tools for data flow and build a simple design as a quick tutorial.

Data Flow and Apache Nifi

Moving any content from A to B is defined as data flow. Sometimes data flow could be executed easily but sometimes could be complex and difficult situations. In my opinion Data flow process depends on 3 components.

  • Data
  • Infrastructure
  • People

Data: Data define as content that will be moving. These contents could be logs files, xml, csv, images, video or any type. And each content may have different standards, formats, protocols, schemas etc. So collection data from different various of source and transfer to target destination could ve complicated.

Infrastructure: It can be defined as platform that includes source or target for data. At this components you need to deal with security, network etc.

People: Contains the component that will design the data flow.

Apache Nifi provides us a powerful and flexible management tool for data ingestion and data flow. Nifi works at your browser (web based) and has a simple drag and drop user interface. It provides platform, that can collect and transform data in real time. You can use Apache Nifi with your Windows or Linux/Mac OS computer.

Download and Start Nifi

Firstly you need to download Nifi by using this official link. You will see 2 options at download page. gz file is for Linux/Mac OS users and zip file is for Windows users.

If you are using Mac OS and have homebrew (software package management system), you can use brew install nifi command at terminal for download and install apache nifi.

After download and install nifi, you need to check service status and maybe need to start service. For Mac OS user you can check service status by typing nifi status command at terminal

According to your nifi service status, you can use these command to start, stop, restart etc service

After you start nifi service, you need to open browser and type http://localhost:8080/nifi/ to access nifi web based interface.

Apache Nifi Key Concepts

Processor: Processor are the basic blocks of creating a data flow. Each processor has different functionality. Apache Nifi has 280+ processors with default installation and also you can write your own processor.

Process Group: User can group their processor based on projects or organizations.

Template: You can save your Nifi flows by using templates. The flows can be used by other user. You can create, download, upload and add templates as a xml file. These templates includes your all processors with processor groups

Connection: Links between processors. Each connections may have some relationship rules

Sample Tutorial

Scenario: In this tutorial, i am going to design a simple data flow that take json data, split them, find special keywords and put result to my local folder.

I am going to use below link to invoke http. https://geoserver.nottinghamcity.gov.uk/opendata/geojson/ncc_Recycling_Centres.json

InvokeHTTP

An HTTP client processor which can interact with a configurable HTTP Endpoint

  1. Drag Processor and drop white area
  2. Select InvokeHTTP and click ADD
  3. right click to your InvokeHTTP processor and Configure.
  4. In configure processor type link above to Remote Url area.

SplitText

The processor splits a text file into multiple smaller text files on line boundaries limited by maximum number of lines or total size of fragment.

You can see SplitText processor configure below. I would like to split my sample JSON file 1 by 1 line.

RouteOnContent

The processor routes FlowFiles based on their Attributes using the Attribute Expression Language. I created 2 new properties using regular expression. I would like to find “Car Park” and “School” contents

MergeContent

The processor merges a group of FlowFiles together based on a user defined strategy and packages them into a single FlowFile. I created 2 MergeContent, one merge “Car Park” content to a single File, other merge “School” content to an single File.

PutFile

The processor writes the contents of a FlowFile to the local file system. You should to change your output directory. For my data flow, I put results to 2 new files shown below ;

/tmp/recyle_data/school

/tmp/recyle_data/car_park

After you designed all processors, you need to connect them with some rules.

You can find above data flow design as a template at my Github Repo.

Finally let’s start data flow and see what’s happening. You can start all proceesor at same time or by one by. After you start your data flow, you can trace processors with input, read/write, output and tasks/times

Let’s control target output to check file. You can check your files at terminal or Nifi Interface.

Terminal

You can use cat or head or tail command to see your data flow’s file content

Nifi Interface

Select PutFile processor and right click then select View Data Provenance

Select row and click View Details by pressing “i” icon and open “Content” tab.

You can download final Json file or you can View content.

In briefly;

  • I design data flow process
  • I invoked HTTP that includes JSON content (recycling center)
  • I split this JSON line by line
  • I created two properties to find keyword (Car Park , School)
  • Nifi founded each keyword and merge them
  • And created 2 individual files based on my keyword rule.

Thanks for reading my post and I hope you like it. Feel free to contact me if you have any questions or if you’d like to share your comments.

Gain Access to Expert View — Subscribe to DDI Intel

Apache Nifi
Data Flow Diagram
Data Engineering
Data Science
Data Analyst
Recommended from ReadMedium