Lab 29: External tables in Hive

Hi Hadoopers,

My previous post demonstrated how to create the tables in hive and how to retrieve the content from the same. All of those tables are ‘internal’ tables, ie., those are hive managed tables. If you drop any table or database, you will see those get deleted from the warehouse folder.

But we used to work a lot with external tables. When you drop an external table, the metastore information (only) will get deleted. The content will still be available in HDFS.

Hence this post will show you how to create external tables.

pighive3

Table creation

I have a CSV file in /user/hadoop/lab28/input. I’ll use it to create the external table.

$ hadoop fs -cat /user/hadoop/lab28/input/employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor

Let’s create the table employee_ext as given below. The structure is same as that of the one I used in my earlier blog post, except the location.

hive> CREATE external TABLE IF NOT EXISTS employee_ext (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED     FIELDS TERMINATED BY ','     STORED AS TEXTFILE     LOCATION '/user/hadoop/lab28/input';
OK
Time taken: 0.086 seconds

Let’s verify if the table is created.

hive> show tables;
OK
demo1
employee_ext
Time taken: 0.08 seconds, Fetched: 2 row(s)

Let’s check the meta store now.

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME     | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
|      1 |  1471989527 |     1 |                0 | hadoop |         0 |     1 | demo1        | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|      7 |  1476505836 |     6 |                0 | hadoop |         0 |     7 | employee     | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|     11 |  1476519584 |     1 |                0 | hadoop |         0 |    11 | employee_ext | EXTERNAL_TABLE | NULL               | NULL               |           NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
3 rows in set (0.00 sec)

The table employee_ext is registered in meta store. Let’s start querying it now.

Querying

Show all records from employee_ext

hive> select * from employee_ext;
OK
1        Dharma NULL     Sr Manager
2        Bheema NULL     Cook
3        Arjuna NULL     Instructor
4        Nakula NULL     Jr Instructor
5        Sahadeva       NULL     Jr Instructor
Time taken: 0.123 seconds, Fetched: 5 row(s)

Describing the table.

hive> describe employee_ext;
OK
eid                     int
ename                   string
eage                    int
edesig                  string
Time taken: 0.044 seconds, Fetched: 4 row(s)

hive> describe formatted employee_ext;
OK
# col_name              data_type               comment

eid                     int
ename                   string
eage                    int
edesig                  string

# Detailed Table Information
Database:               default
Owner:                  hadoop
CreateTime:             Sat Oct 15 16:26:18 MYT 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://gandhari:9000/user/hadoop/lab28/input
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1476519978

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
field.delim             ,
serialization.format    ,
Time taken: 0.053 seconds, Fetched: 30 row(s)

Let’s update one table using another query result.Let’s create a new table first. This is a hive managed table.

hive> CREATE TABLE IF NOT EXISTS employee_new (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED     FIELDS TERMINATED BY ','     STORED AS TEXTFILE;
OK
Time taken: 0.101 seconds

Let’s update the new table now.

hive> INSERT OVERWRITE TABLE employee_new SELECT * FROM employee_ext limit 2;
hive> select * from employee_new;
OK
2        Bheema NULL     Cook
1        Dharma NULL     Sr Manager
Time taken: 0.078 seconds, Fetched: 2 row(s)

It is updated as expected. Let’s see what happened in HDFS. The WareHouse folder contains a MR result.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hive/warehouse/employee_new;
Found 1 items
-rwxrwxr-x   3 hadoop supergroup         44 2016-10-15 16:38 /user/hive/warehouse/employee_new/000000_0

Let’s try to cat the file.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -cat /user/hive/warehouse/employee_new/000000_0
2, Bheema,\N, Cook
1, Dharma,\N, Sr Manager
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s