Hive Tables
A Hive table is like a container that holds data in the form of rows and columns. Each table has:
- Columns (which define the type of data stored, like numbers, text, or dates).
- Rows (which are the actual records or pieces of data).
In Hive, tables do not store data directly like a traditional database. Instead:
- The data is stored in files in HDFS (Hadoop Distributed File System).
- The Hive table stores metadata (like schema, column names, data types, file format, and location).
Why Do We Use Tables in Hive?
- They help you organize large datasets in a structured way.
- They allow you to run SQL-style queries (HiveQL) to filter, group, join, and analyze big data easily.
- They let you define how the data is stored (file format, delimiter, partitions).
- They support batch processing for large-scale reporting and analytics.
Types of Tables
Hive offers two main types of tables:
Managed Tables (Internal Tables) -
In a managed table, Hive controls both the table metadata and the actual data files. When you drop the table, both the table and the data are deleted from HDFS.
Key Points -
- Data is deleted if you drop the table.
- Hive manages everything for you.
Example:
CREATE TABLE sales_data (
order_id INT,
customer STRING,
amount DOUBLE
);
This is a managed table because Hive automatically saves the data under its default warehouse location (e.g., /user/hive/warehouse/sales_data).
External Tables -
In an external table, Hive only manages the table metadata, but the actual data stays where it is in HDFS. If you drop the table, the data files are not deleted.
Key Points -
- Data is safe even if the table is dropped.
- Useful when the same data is shared across multiple tools or teams.
Example:
CREATE EXTERNAL TABLE customer_data (
customer_id INT,
customer_name STRING,
city STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hadoop/customers';
This is an external table because you are telling Hive where the data is stored.