Summary -
In this topic, we described about the Create Table with detailed example.
As discussed in the previous chapter, Table is nothing but the set of rows and columns that are logically related. In another words, table is nothing logical structure of storing data. CREATE TABLE statement used to create the table. The syntax for various ways of creating table are below.
Explicit column definitions:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT 'col_comment'], ...)]
[COMMENT 'table_comment']
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION 'hdfs_path']
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[CACHED IN 'pool_name']
Column definitions inferred from data file:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
LIKE PARQUET 'hdfs_path_of_parquet_file'
[COMMENT 'table_comment']
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION 'hdfs_path']
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[CACHED IN 'pool_name']
data_type
: primitive_type
CREATE TABLE AS SELECT:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
db_name.]table_name
[COMMENT 'table_comment']
[WITH SERDEPROPERTIES ('key1'='value1',
'key2'='value2', ...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION 'hdfs_path']
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[CACHED IN 'pool_name']
AS
select_statement
Lets discuss about each option in detail.
CACHED IN clause -
The CACHED IN clause used to be loaded any existing or future data files in the table directory or the partition subdirectories into memory with the HDFS caching mechanism.
CREATE TABLE AS SELECT -
The CREATE TABLE AS SELECT syntax to create a table based on column definitions from another table and copy data from the source table to the destination table. It’s not requiring to issue separate INSERT statement. Any column or table comments from the original table are not carried over to the new table.
CREATE TABLE LIKE PARQUET -
The CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file' allows skipping the column definitions of the CREATE TABLE statement. The column names and data types are automatically configured by using organization of the specified Parquet data file reside in HDFS.
EXTERNAL and LOCATION -
The EXTERNAL and LOCATION clauses are often specified together. The EXTERNAL and LOCATION clauses are optional clauses. EXTERNAL clause used when user tries to create external table.
If user tries to create external table, the external table storing LOCATION needs to be specified. LOCATION is optional for external tables, and can also specify LOCATION for internal tables. An external table requires a LOCATION clause to specify the path to the HDFS directory to reads and writes files for the table.
LIKE clause -
To create an empty table with the same columns, comments, and other attributes as another existing table. The CREATE TABLE ... LIKE form allows a restricted set of clauses like LOCATION, COMMENT, and STORED AS clauses.
PARTITIONED BY clause -
The PARTITIONED BY clause divides the data files from one or more specified columns based on the values. Shell queries can use the partition metadata to minimize the amount of data that is read from disk, particularly during join queries.
STORED AS and ROW FORMAT clauses -
The STORED AS clause identifies the format of the original data files. By default no STORED AS clause is specified. The data files in tables are created as text files with Ctrl-A (hex 01) characters as the delimiter.
The ROW FORMAT DELIMITED clause to produce or consume data files that use a different delimiter character such as tab or |, or a different line end character.
The delimiter and line end characters with the FIELDS TERMINATED BY and LINES TERMINATED BY clauses are '\t' for tab, '\n' for carriage return, '\r' for linefeed, and \0 for ASCII nul (hex 00).
TBLPROPERTIES and WITH SERDEPROPERTIES clauses -
TBLPROPERTIES clause used to associate subjective items of metadata with a table. TBLPROPERTIES clause takes a comma-separated list of key-value pairs. TBLPROPERTIES clause also stores key-value pair items in the metastore database. The table properties can be changed with an ALTER TABLE statement.
WITH SERDEPROPERTIES clause associate SerDes properties with the table by specifying key-value pairs. Some DDL operations interact with other Hadoop components require specifying particular values in the SERDEPROPERTIES or TBLPROPERTIES fields.
Example -
Assume we have a requirement to create student information with the columns student number, name, class, grade for a school.
Std No | Std Name | Std Class | Std Grade |
---|---|---|---|
1 | Pawan | 10 | A |
2 | Srinivas | 10 | A |
3 | Sridhar | 10 | A |
4 | Kumar | 10 | B |
Internal table -
CREATE TABLE std_details_i (std_no INT,
std_name STRING, std_class INT, std_grade STRING)
COMMENT ‘Student Information’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
After successful execution of the above statement, std_details_i table created.
External table -
CREATE EXTERNAL TABLE std_details_e (std_no INT,
std_name STRING, std_class INT, std_grade STRING)
COMMENT ‘Student Information’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS PARQUET LOCATION '/usr/hcatalog/std_db;
After successful execution of the above statement, std_details_e table created.