Summary

The web content provides an in-depth explanation of Apache Hive's architecture, detailing its four main components: Hadoop core components (HDFS and MapReduce), Metastore, Driver, and Hive Clients, and how they interact to execute Hive queries.

Abstract

Apache Hive is a pivotal tool for ETL and data warehousing within the Hadoop ecosystem, facilitating data summarization and analysis. It leverages a table structure similar to relational databases, with data organization enhanced through partitioning and bucketing. The architecture of Hive is dissected into four key components: Hadoop's HDFS for data storage, MapReduce for processing queries, Metastore for housing metadata, and the Driver for query compilation and execution planning. Additionally, various Hive clients serve as interfaces for query submission. The article emphasizes the Metastore's significance, holding crucial metadata, and illustrates how Hive integrates with these components to efficiently manage and query large datasets.

Opinions

The author emphasizes the importance of the Metastore in Hive's architecture, considering it a "crucial part" for storing metadata.
The use of HDFS and MapReduce as foundational components underscores Hive's reliance on Hadoop's core functionalities.
The article suggests that access to the Metastore is typically restricted, implying a need for administrative control and security.
By mentioning the conversion of HiveQL queries into MapReduce jobs, the author highlights Hive's ability to translate high-level queries into executable Java code.
The inclusion of the Driver component in the architecture reflects the complexity of query processing and the need for an intermediary to generate execution plans.
The author provides a practical perspective by referencing real-world configurations, such as those found in hive-site.xml and core-site.xml, indicating the importance of proper setup for Hive's operation.
The mention of Hive clients, including terminal interfaces like hive CLI and beeline, as well as web interfaces like Hue and Ambari, suggests a preference for user-friendly query submission methods.
The article concludes with an invitation for reader engagement and contribution, indicating a community-driven approach to knowledge sharing and continuous learning in the field of big data.

Hive Architecture in Depth

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of large data systems in open source Hadoop platform. The tables in Hive are similar to tables in a relational database, and data units can be organized from larger to more granular units with the help of Partitioning and Bucketing.

As part of this blog, I will be explaining as to how the architecture works on executing a hive query. Details such as the execution of queries, format, location and schema of hive table inside the Metastore etc.

There are 4 main components as part of Hive Architecture.

Hadoop core components(Hdfs, MapReduce)
Metastore
Driver
Hive Clients

Let’s start off with each of the components individually.

Hadoop core components:

i) HDFS: When we load the data into a Hive Table it internally stores the data in HDFS path i.e by default in hive warehouse directory.

The hive default warehouse location can be found at

We can create a hive table and load the data into it as shown in below images.

ii) MapReduce: When we Run the below query, it will run a Map Reduce job by converting or compiling the query into a java class file, building a jar and execute this jar file.

2. Metastore: is a namespace for tables. This is a crucial part for the hive as all the metadata information related to the hive such as details related to the table, columns, partitions, location is present as part of it. Usually, the Metastore is available as part of the Relational databases eg: MySql

You can check the DataBase configuration via hive-site.xml

The Metastore details can be found as shown below.

Metastore has an overall 51 tables describing various properties related to the table but out of these 51 the below 3 tables are the ones which provide majority of the information, that in-turn helps hive infer the schema properties to execute the hive commands on a table.

i) TBLS — Store all table information (Table name, Owner, Type of Table(Managed|External)

ii) DBS — Database information (Location of database, Database name, Owner)

iii) COLUMNS_V2 — column name and the datatype

Note: In real-time the access to metastore is restricted to the Admin and specific users.

3. Driver: The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. The execution plan created by the compiler is a DAG of stages.

A bunch of jar files that are part of hive package help in converting these HiveQL queries into equivalent MapReduce jobs(java) and execute them on MapReduce.

To check if the hive can talk to the appropriate cluster. i.e for hive to interact, query or execute with the existing cluster. You can check details under core-site.xml.

You can verify the same from hive as well.

4. Hive Clients: It is the interface through which we can submit the hive queries. eg: hive CLI, beeline are some of the terminal interfaces, we can also use the Web-interface like Hue, Ambari to perform the same.

On connecting to Hive via CLI or Web-Interface it establishes a connection to Metastore as well.

The commands and queries related to this post are added as part of my GIT repository.

If you enjoyed reading it, you can click the heart ❤️ below and let others know about it. If you have got anything to add please feel free to leave a response 💬💭.