How To Call REST API & Store Data in Databricks
This article will go through the concepts of REST API and how to call it using Databricks. We will also learn to process JSON structures received from REST service and store data in Databricks (delta tables).
Databricks
Databricks is a popular cloud-based computing platform for data science and analytics. It simplifies the process of processing large volumes of data from files, streams, databases, and also REST services.
Databricks is a powerful platform that enables you to process large volumes of data from various sources. One of the main reasons for its success is that it makes it easy to identify and identify large amounts of data stored in disparate formats, databases, and other storage systems.
What is REST service
A REST API is an application program interface that facilitates interaction between computer systems on the Internet. REST APIs are typically specified using XML, JSON, YAML, or some other data serialization format.
REST APIs are becoming very popular, which means they are being used more often by programmers to build software applications. The architecture of a RESTful system is based on resources and these can include, but are not limited to data records in a database
Credits to postcodes.io
Postcodes.io is an open sourced project maintained by Ideal Postcodes. It is a free resource, allowing developers to search, reverse geocode and extract UK postcode and associated data.
In this example, I used to consume a random postcode
Code snippet explanation
This code is executed in a Databricks notebook with Python
Import required python packages
First we import 2 required Python packages http (processing http request) and json (processing JSON received from rest call)
import http
import jsonEstablish connection to postcodes.io
conn = http.client.HTTPSConnection("api.postcodes.io")Setup input parameters to REST call
- payload (I am making a call to retrieve random postcode, hence payload is empty)
- header (will provide the cookie required make connection to postcodes)
payload = ''
headers = {
'Cookie': '__cfduid=d2e270bea97599e2fbde210bf483fcd491615195032'
}Make the call with API endpoint
Here we are only making GET call for a random
conn.request("GET", "/random/postcodes", payload, headers)Receive response and store as JSON
In this step, store received response and convert it into UTF-8 format
res = conn.getresponse()
data = res.read().decode("utf-8")
jsondata = json.loads(json.dumps(data))Convert the JSON data into a dataframe
This is where many go wrong, JSON data should parallelized
df = spark.read.json(sc.parallelize([jsondata]))Select only required columns
previous yields a dataframe with 2 columns, status and result. Column result is again in JSON format, extract the values from result as following
df_temp = df.selectExpr("string(status) as status","result['country'] as country", "result['european_electoral_region'] as european_electoral_region", "string(result['latitude']) as latitude", "string(result['longitude']) as longitude", "result['parliamentary_constituency'] as parliamentary_constituency", "result['region'] as region","'' as vld_status","'' as vld_status_reason")Finally write the data into a table
Once selection of columns are complete, write the data into a table
df_temp.write.format("delta").mode("append").saveAsTable(f"{table_name}")Result looks as following

Here is the full snippet of code





