Hive Load Data to Table
In Hive, loading data means placing your data files into the directory that Hive uses to store the table’s data. This is not the same as inserting one row at a time like in regular databases. Hive is built for batch processing—so you load whole files.
For example, you have a CSV file called employees.csv saved in HDFS or your local machine. To make Hive recognize this file as a table, you load it into Hive.
LOAD DATA LOCAL INPATH '/home/user/employees.csv'
INTO TABLE employee_data;
This command tells Hive: "Take this file and move it into the Hive table’s storage location so I can query it."
Why Do We Need to Load Data?
- Hive tables are mostly used for big data analysis, so data usually comes from files.
- Loading data is the first step before you can run queries like SELECT, JOIN, or GROUP BY.
- We may receive new data daily, weekly, or monthly and need to load it into Hive regularly.
Syntax -
Load Data from Local File System: used when our file is stored on the local machine (like your laptop or server).
LOAD DATA LOCAL INPATH 'local_file_path'
INTO TABLE table_name;
Load Data from HDFS (Hadoop File System): used when our file is already stored in HDFS, you can load it directly without using LOCAL.
LOAD DATA INPATH 'hdfs_file_path'
INTO TABLE table_name;
Load Data with OVERWRITE Option: used when we want to replace the existing data in the Hive table.
LOAD DATA INPATH 'hdfs_file_path'
OVERWRITE INTO TABLE table_name;
Examples -
Scenario1: Loading file from the local machine:
LOAD DATA LOCAL INPATH '/home/user/employees.csv'
INTO TABLE employee_data;
Hive copies the file into its warehouse directory in HDFS. The file is now ready for querying.
Scenario2: Loading a file in HDFC environment:
LOAD DATA INPATH '/user/hadoop/employees.csv'
INTO TABLE employee_data;
This is commonly used when working in big data environments where data files are already in HDFS.
Scenario3: Loading a file with OVERWRITE:
LOAD DATA INPATH '/user/hadoop/updated_employees.csv'
OVERWRITE INTO TABLE employee_data;
Deletes existing data in the table and loads only the new file you specify.
Scenario4: Load Data into Partitioned Tables:
LOAD DATA INPATH '/user/hadoop/sales_july.csv'
INTO TABLE sales_data PARTITION(month='July');
This helps in organizing big datasets efficiently, making queries faster.