Hive Indexes
A Hive Index is a data structure that improves the speed of query execution by providing a quick way to find rows in a table without scanning the whole table. It works just like an index in a regular database system.
In simple words:
- Without an index: Hive looks at every row in the table.
- With an index: Hive can skip rows that don’t match, making things faster.
For example, if you have a huge table called employee_data and you often query by department_id, creating an index on department_id will make those queries faster.
Why Do We Use Indexes?
Using indexes in Hive makes sense when:
- We have large tables and need to speed up queries on specific columns.
- We often use the same column in WHERE clauses.
- We want to reduce the amount of data Hive scans.
How to Create and Use Indexes in Hive?
Hive gives you a simple way to create, use, and manage indexes. Let’s walk through the process -
CREATE INDEX -
The CREATE INDEX command is used to create an index on a table column that you often filter in your queries.
Syntax -
CREATE INDEX index_name
ON TABLE table_name (column_name)
AS 'path_name'
WITH DEFERRED REBUILD;
Examples -
CREATE INDEX idx_department
ON TABLE employee_data (department_id)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
It creates an index named idx_department. The index is built on the column department_id of the employee_data table. The keyword WITH DEFERRED REBUILD means that the index structure is created but not filled with data yet.
ALTER INDEX -
After creating the index, you need to build (or populate) it using the ALTER INDEX command. Without this step, the index won't contain any useful data.
Syntax -
ALTER INDEX index_name ON table_name REBUILD;
Examples -
ALTER INDEX idx_department ON employee_data REBUILD;
It fills the index with the actual data from the table and makes the index ready for use by Hive when running queries.
When to use this?
We must always run ALTER INDEX REBUILD after creating the index. We may also use it later if the underlying table data changes and you need to refresh the index.
Creating an index is like building an empty library shelf. Running REBUILD is like putting the books (data) on the shelf.
DROP INDEX -
If you no longer need an index, you can remove it using the DROP INDEX command. This is helpful when:
- The index is not used anymore.
- We want to save storage space.
- The table structure has changed.
Syntax -
DROP INDEX index_name ON table_name;
Examples -
DROP INDEX idx_department ON employee_data;
It deletes the index named idx_department and removes the index metadata and any stored index data.
Why drop an index?
To clean up unused or outdated indexes that no longer help in query performance.
Real-Life Scenario Example (Step by Step)
Imagine you are managing a Hive table called sales_data with millions of rows. We often run this query:
SELECT * FROM sales_data WHERE region = 'North';
Step-1: Create an index:
CREATE INDEX idx_region ON TABLE sales_data (region)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
Step-2: Build the index:
ALTER INDEX idx_region ON sales_data REBUILD;
Step-3: Run your query (Hive can now use the index):
SELECT * FROM sales_data WHERE region = 'North';
Step-4: Drop the index if not needed:
DROP INDEX idx_region ON sales_data;