avatarRamesh Nelluri - Ideas to Life

Summary

The article provides a comprehensive guide on how to call a REST API from Databricks, process the JSON response, and store the data in delta tables within Databricks.

Abstract

The article in question delves into the technical process of interfacing with a REST API using Databricks, a cloud-based data analytics platform. It begins by introducing the concept of REST APIs and their role in modern web interactions, emphasizing their popularity and utility in software development. The author then illustrates how to utilize Databricks to process JSON data obtained from a REST service, specifically demonstrating this with the open-source project postcodes.io, which provides UK postcode data. The step-by-step guide includes establishing a connection to the API, setting up the necessary parameters for the REST call, making the GET request, and handling the response by converting it into a JSON format. Subsequently, the article explains how to transform the JSON data into a DataFrame, select relevant columns, and ultimately store the processed data in a delta table within Databricks. The article concludes by providing a visual representation of the results and encouraging readers to explore further topics in data engineering through a series of learning articles. Additionally, the author recommends an AI service as a cost-effective alternative to ChatGPT Plus (GPT-4).

Opinions

  • The author positions Databricks as a powerful and user-friendly platform for handling large volumes of data from diverse sources, including REST APIs.
  • There is an emphasis on the importance of understanding how to work with JSON data structures, which are commonly returned by REST APIs.
  • The author suggests that converting JSON data into a DataFrame is a critical step that is often done incorrectly, highlighting a common pitfall in data processing.
  • By showcasing the use of postcodes.io, the author implies that leveraging open-source projects can be beneficial for developers seeking to integrate various data sets into their analytics workflows.
  • The recommendation of ZAI.chat as an AI service indicates the author's belief in the value of cost-effective AI solutions that offer comparable functionality to more expensive options like ChatGPT Plus (GPT-4).

How To Call REST API & Store Data in Databricks

This article will go through the concepts of REST API and how to call it using Databricks. We will also learn to process JSON structures received from REST service and store data in Databricks (delta tables).

Photo by Ferenc Almasi on Unsplash

Databricks

Databricks is a popular cloud-based computing platform for data science and analytics. It simplifies the process of processing large volumes of data from files, streams, databases, and also REST services.

Databricks is a powerful platform that enables you to process large volumes of data from various sources. One of the main reasons for its success is that it makes it easy to identify and identify large amounts of data stored in disparate formats, databases, and other storage systems.

What is REST service

A REST API is an application program interface that facilitates interaction between computer systems on the Internet. REST APIs are typically specified using XML, JSON, YAML, or some other data serialization format.

REST APIs are becoming very popular, which means they are being used more often by programmers to build software applications. The architecture of a RESTful system is based on resources and these can include, but are not limited to data records in a database

Credits to postcodes.io

Postcodes.io is an open sourced project maintained by Ideal Postcodes. It is a free resource, allowing developers to search, reverse geocode and extract UK postcode and associated data.

In this example, I used to consume a random postcode

Code snippet explanation

This code is executed in a Databricks notebook with Python

Import required python packages

First we import 2 required Python packages http (processing http request) and json (processing JSON received from rest call)

import http
import json

Establish connection to postcodes.io

conn = http.client.HTTPSConnection("api.postcodes.io")

Setup input parameters to REST call

  1. payload (I am making a call to retrieve random postcode, hence payload is empty)
  2. header (will provide the cookie required make connection to postcodes)
payload = ''
headers = {
          'Cookie': '__cfduid=d2e270bea97599e2fbde210bf483fcd491615195032'
          }

Make the call with API endpoint

Here we are only making GET call for a random

conn.request("GET", "/random/postcodes", payload, headers)

Receive response and store as JSON

In this step, store received response and convert it into UTF-8 format

res = conn.getresponse()
data = res.read().decode("utf-8")
jsondata = json.loads(json.dumps(data))

Convert the JSON data into a dataframe

This is where many go wrong, JSON data should parallelized

df = spark.read.json(sc.parallelize([jsondata]))

Select only required columns

previous yields a dataframe with 2 columns, status and result. Column result is again in JSON format, extract the values from result as following

df_temp = df.selectExpr("string(status) as status","result['country'] as country", "result['european_electoral_region'] as european_electoral_region", "string(result['latitude']) as latitude", "string(result['longitude']) as longitude", "result['parliamentary_constituency'] as parliamentary_constituency", "result['region'] as region","'' as vld_status","'' as vld_status_reason")

Finally write the data into a table

Once selection of columns are complete, write the data into a table

df_temp.write.format("delta").mode("append").saveAsTable(f"{table_name}")

Result looks as following

REST call Results in databricks

Here is the full snippet of code

Hope you had an insightful learning with REST API call from databricks and storing in a delta table.

Curious about learning further about Graph Databases, Neo4J, Python, Databricks Spark, and Data Engineering — Please follow this series of learning Articles to read and learn more about Python, Spark, and Data Engineering please follow Ramesh Nelluri and subscribe to medium

Databricks
Python
Spark
Data Engineering
Insights And Data
Recommended from ReadMedium