Design Recommendations and Guidelines for Adopting Unity Catalog in Your Existing Azure Data Lakehouse + FAQ
In this blog post, my aim is to guide you on how to effectively utilize the Unity Catalog within your current data lakehouse and offer my recommended best practices. These practices are designed to help you fully leverage the new features of the Unity Catalog without causing disruptions to your existing data architecture. Additionally, I will outline the recommended design patterns for new Data Lakehouse implementation using Managed Tables including an FAQ section for clarity.
Description of Existing Data Lakehouse Architecture: Legacy Method Using Mount Points
Over the past few years, I have worked with various clients in the Azure space who have embraced and implemented the Medallion architecture for their enterprise Data Lake. This architecture typically involves having at least three different Azure Data Lake storage containers (bronze, silver, gold) to store various stages of data refinement. Databricks is commonly used in conjunction with a data lake to ingest, process, and write data back to the data lake containers or zones. In this architecture, it is common to create Databricks mount points against the different data lake zones.
To be specific, when creating mount points on Azure Data Lake Storage Gen2, the following steps are required:
- Create a Service Principal in Azure Active Directory, which acts as an application identity used to facilitate authentication between the Databricks application and Azure Data Lake Storage.
- Once the service principal is correctly configured, it is used to set up the mount points within the Databricks workspace. These mount points have a file path format such as ‘/mnt/bronzecontainername’.
With this configuration in place, you can easily read and write transformed data into the data lake using Databricks. However, if there’s a need to create a schema over the files existing in your data lake, you can use the Create External Tables feature to query them like regular SQL tables. It’s important to note that the table metadata will be stored in the Hive Metastore catalog within your workspace. Additionally, please keep in mind that the Hive Metastore is tied to a specific workspace.
Enabling Unity Catalog in Your Existing Data Architecture Pattern
Like any new technology, early adopters may find it challenging to navigate the unfamiliar waters of adopting a new feature like Unity Catalog. It’s not always straightforward because while Databricks understands your enterprise architecture and can offer recommendations, it may not be sufficient. In this article, I aim to demystify and address your concerns, allowing you to safely adopt Unity Catalog and reap its benefits without significant disruption.
Key Terms to Understand in Unity Catalog: External Table vs. Managed Table
Understanding the difference between a managed table and an external table is crucial for this article, please refer to the FAQ section at the end.
Steps to Leverage Unity Catalog
- Start by enabling Unity Catalog in your Databricks Regional Account. Enabling Unity Catalog will not alter any existing setups or affect legacy file management or access to your legacy Hive Metastore or mount points. It merely adds new features.
- Next, assess your data management needs and how you wish to leverage Unity Catalog. The answers to these questions will guide you on any necessary changes to make the most of Unity Catalog’s features.
Common Unity Catalog Use Cases and Architectural Recommendations
- Data as a Service: If you have numerous table schemas exposed in your Hive Metastore, sourced from your Gold data lake zone, and you want to unify data access across various Databricks workspaces for business intelligence analytics and reporting, you can avoid creating multiple mount points on different workspaces. This simplifies data access control and provides better access control for specific users, both at the table and row levels.
- Create a Unity Catalog Metastore for your enterprise account and enable it on the new workspace.
- Establish an external location over the Gold zone storage account to make data available to Unity Catalog.
- Create a catalog on that external location, allowing you to create new tables with new data or add schemas to existing tables in the Gold zone storage account.
- Grant access to the databases under the catalog and tables for various users or user groups as needed.
- Restrict direct access to the data lake storage account files to ensure data is accessed exclusively through Unity Catalog.
2. Migrating from Mount Points to External Locations in ELT Pipelines: To transition from using mount points to external locations while maintaining external file storage and avoiding managed delta tables:
- Use the external location path to access the delta files instead of mount points (e.g., “abfss://[email protected]/”).
3. Datawarehouse Modernization with Managed Tables: Suppose you want to create separate catalogs for your bronze, silver, and gold databases, with managed tables stored in distinct Azure Data Lake containers. For instance, you’d like bronze managed tables’ data to be stored in a bronze ADLS container.
- Create a catalog in the UI or script and specify an external location linked to the desired storage location.
- Any managed table created within this catalog will be saved in the external location set during catalog creation.
FAQ
- What is the naming convention to use for objects in Unity Catalog? Please be mindful of the naming convention when creating a catalog. I recommend using a 4-level naming format: Env — Database — Schema — Table/View. However, Unity Catalog only allows Database/Catalog, Schema, and Table/View. It’s essential to concatenate the Environment name with the Database name. For example, you can name a catalog “dev_sales.”
2. What is the cost of Unity Catalog: It is free.
3. Can you browse the schema of a Unity catalog without a cluster? Yes.
4. Can you browse the schema of the Hive Metastore catalog? No.
5. What is the difference between a managed table and an external table?
Managed Tables
In Databricks, a managed table implies that the delta table data is stored in the data lake, but its location is controlled by Databricks to prevent user manipulation. If you drop a managed table, both the data and metadata are deleted.
Think of managed tables as delegating the storage of table data/files to Databricks. You lose the ability to configure storage paths according to your custom requirements at a fine-grained level. It’s akin to using an on-premise data warehouse where you can’t access the underlying files directly; you can only interact with the data using SELECT statements. The advantage of using managed tables is that Databricks handles many optimizations and performance tuning automatically on your behalf. However, with Unity Catalog, you can specify an external location where all managed table data for that catalog will reside. Databricks is encouraging users to move away from managing files and folders externally in the data lake.
External Tables
For external tables, the data is stored in a location of your choice, and you have full accessibility to see the underlying folder structure and storage path. If you drop an external table, only the metadata of the table is removed from Unity Catalog, while the data remains intact.
6. Can you create an external table in a Catalog using a different storage container than the storage account (external location) assigned during the creation of the catalog? Yes, you can create an external table from any storage location in your data lake of choice, but it is not recommended.
Further readings
This blog post goes into more depth on all features of Unity Catalog very good …https://readmedium.com/databricks-unity-catalog-all-you-need-to-know-3add40486547#bf6b
About Me
I am passionate about empowering, educating, and encouraging individuals pursuing a career in data engineering. Currently a Senior Data Engineer at Capgemini, specializing in Azure Databricks
Linkedin: https://www.linkedin.com/in/nobieyisi/
