Summary
The author has demonstrated a POC using Presto to query data from Azure Blob Storage and Azure Data Lake Store, leveraging the hive-hadoop2 connector and a Hive metastore, with a setup run on Docker containers on a local machine or VM in Azure, and provided a video walkthrough and source code on GitHub.
Abstract
In a recent project, the author implemented a proof of concept (POC) for querying data stored in Azure Blob Storage and Azure Data Lake Store using a single-node Presto setup. The Presto deployment, which can be of version 0.167 or 0.178, is configured with the hive-hadoop2 connector and additional JARs, and relies on the Hive metastore service to manage metadata for the tables. This POC is executed in Docker containers, both for Presto and Hive, orchestrated by docker-compose. The author entered the running containers to test data read and write operations on tables backed by the Azure storage solutions. A concise video guide has been made available to complement the documentation, along with the complete source code available on GitHub for interested parties to explore and potentially replicate the setup. The author invites feedback and inquiries on this work via Twitter.
Opinions
- The author demonstrates a preference for using open-source tools like Presto and Docker for creating a flexible and scalable data processing environment.
- They highlight the utility of integrating Presto with Azure services for querying big data efficiently.
- The author values community engagement and knowledge sharing, as indicated by the public availability of the video walkthrough, source code, and active solicitation for feedback and questions.
- By providing a step-by-step guide and resources, the author exhibits a commitment to aiding others in implementing similar solutions.
- The choice to use both local machine and Azure VM options indicates an understanding of varying user environments and a desire to make the POC broadly accessible.
- The author's reference to using a Hive metastore shows an appreciation for metadata management in distributed data processing contexts.