Presto querying data in Azure Blob Storage and Azure Data Lake Store

Summary

The author has demonstrated a POC using Presto to query data from Azure Blob Storage and Azure Data Lake Store, leveraging the hive-hadoop2 connector and a Hive metastore, with a setup run on Docker containers on a local machine or VM in Azure, and provided a video walkthrough and source code on GitHub.

Abstract

In a recent project, the author implemented a proof of concept (POC) for querying data stored in Azure Blob Storage and Azure Data Lake Store using a single-node Presto setup. The Presto deployment, which can be of version 0.167 or 0.178, is configured with the hive-hadoop2 connector and additional JARs, and relies on the Hive metastore service to manage metadata for the tables. This POC is executed in Docker containers, both for Presto and Hive, orchestrated by docker-compose. The author entered the running containers to test data read and write operations on tables backed by the Azure storage solutions. A concise video guide has been made available to complement the documentation, along with the complete source code available on GitHub for interested parties to explore and potentially replicate the setup. The author invites feedback and inquiries on this work via Twitter.

Opinions

The author demonstrates a preference for using open-source tools like Presto and Docker for creating a flexible and scalable data processing environment.
They highlight the utility of integrating Presto with Azure services for querying big data efficiently.
The author values community engagement and knowledge sharing, as indicated by the public availability of the video walkthrough, source code, and active solicitation for feedback and questions.
By providing a step-by-step guide and resources, the author exhibits a commitment to aiding others in implementing similar solutions.
The choice to use both local machine and Azure VM options indicates an understanding of varying user environments and a desire to make the POC broadly accessible.
The author's reference to using a Hive metastore shows an appreciation for metadata management in distributed data processing contexts.

Presto querying data in Azure Blob Storage and Azure Data Lake Store

Recently, I created a simple POC of a single-node Presto querying data in Azure Blob Storage (WASB) and Azure Data Lake Store (ADLS).

In my example, Presto (version 0.167 or 0.178) is accessing these data stores via Presto’s hive-hadoop2 connector (with a few additional JARs) and needs Hive metastore service to store the metadata about the tables (i.e. table definition, location, and storage format). Therefore, I create Presto and Hive containers and run them via docker-compose on my local machine (or a VM in Azure). Once the containers are running, I execute bash shell (i.e. docker exec -it container_id bash) on the running containers and try reading and writing data into tables backed by Azure Blob Storage and Azure Data Lake Store.

Presto querying data in Azure Blob Storage and Azure Data Lake Store

Video Walkthrough

Diagram