Lab 28: Getting started with Apache Hive

Hi Hadoopers,

We had an interesting journey into Apache Pig. I’m happy to start another blog post series on Apache Hive today.

I’m running short of time as I’m rushing to conclude my training schedule. So I’m do not have much time to draw the architecture diagram and explain it. When I do a revisit, I may be able to do it. Let me write about my first lab exercise in this post.

pighive3

As usual, the data would be available in HDFS. But Hive, gives a virtual tabular format using metastore database. Hence it maintains a local or remote database to store the relationship. Using this relationship, it reads the HDFS data. Hence this blog post will show you the commands and the meta store in parallel.

Show Databases

The command to see the list of tables is given below.

hive>  show databases;
OK
default
Time taken: 1.225 seconds, Fetched: 1 row(s)

The database information is stored in DBS table in meta store. Let’s see the content of DBS.

mysql> use metastore;
mysql> show tables;
+---------------------------+
| Tables_in_metastore       |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| DATABASE_PARAMS           |
| DBS                       |
| DB_PRIVS                  |
| DELEGATION_TOKENS         |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| IDXS                      |
| INDEX_PARAMS              |
| MASTER_KEYS               |
| NUCLEUS_TABLES            |
| PARTITIONS                |
| PARTITION_EVENTS          |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_PRIVS            |
| PART_COL_STATS            |
| PART_PRIVS                |
| ROLES                     |
| ROLE_MAP                  |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| TBL_COL_PRIVS             |
| TBL_PRIVS                 |
| TYPES                     |
| TYPE_FIELDS               |
| VERSION                   |
+---------------------------+
43 rows in set (0.00 sec)
mysql> select * from DBS;
+-------+-----------------------+------------------------------------------+---------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                          | NAME    | OWNER_TYPE | OWNER_NAME |
+-------+-----------------------+------------------------------------------+---------+------------+------------+
|     1 | Default Hive database | hdfs://gandhari:9000/user/hive/warehouse | default | ROLE       | public     |
+-------+-----------------------+------------------------------------------+---------+------------+------------+
1 row in set (0.00 sec)

Database Creation

Let’s create a database my name mydatabase.

 hive> create database mydatabase;
 OK
 Time taken: 0.286 secondsmysql> select * from DBS;
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                                        | NAME       | OWNER_TYPE | OWNER_NAME |
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
|     1 | Default Hive database | hdfs://gandhari:9000/user/hive/warehouse               | default    | ROLE       | public     |
|     6 | NULL                  | hdfs://gandhari:9000/user/hive/warehouse/mydatabase.db | mydatabase | USER       | hadoop     |
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
2 rows in set (0.00 sec)

New database is created with id 6 now.

Here is the change on the HDFS side. You will see a .db folder created in the warehouse folder of Hive.

hadoop@gandhari:/opt/hadoop-2.6.4$ hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxrwxr-x   - hadoop supergroup          0 2016-08-24 05:58 /user/hive/warehouse/demo1
drwxrwxr-x   - hadoop supergroup          0 2016-10-15 12:30 /user/hive/warehouse/mydatabase.db

Table Creation

Let’s create a table now.

 hive> use mydatabase;
 OK
 Time taken: 0.016 seconds
 hive>

 hive> CREATE TABLE IF NOT EXISTS employee (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
 OK
 Time taken: 0.103 seconds

We created employee table, correct? What update did the meta store have now?

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
|      1 |  1471989527 |     1 |                0 | hadoop |         0 |     1 | demo1    | MANAGED_TABLE | NULL               | NULL               |           NULL |
|      7 |  1476505836 |     6 |                0 | hadoop |         0 |     7 | employee | MANAGED_TABLE | NULL               | NULL               |           NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
2 rows in set (0.00 sec)

A new row is added in TBLS table for employee.

What about the change on HDFS. you see another directory is created in the warehouse folder.

hadoop@gandhari:/opt/hadoop-2.6.4$ hadoop fs -ls /user/hive/warehouse/mydatabase.db
Found 1 items
drwxrwxr-x   - hadoop supergroup          0 2016-10-15 12:30 /user/hive/warehouse/mydatabase.db/employee

Loading the content

We have a local CSV file for employees as given below.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ cat employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor

To load the file from unix file system to Hive, we use the below given statement.

 hive> LOAD DATA LOCAL INPATH '/opt/hadoop-2.6.4/hivefarm/employee.csv' OVERWRITE INTO TABLE employee;
 Loading data to table mydatabase.employee
 OK
 Time taken: 0.664 seconds

The CSV file is uploaded to hdfs.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hive/warehouse/mydatabase.db/employee
Found 1 items
-rwxrwxr-x   3 hadoop supergroup        264 2016-10-15 12:51 /user/hive/warehouse/mydatabase.db/employee/employee.csv

Querying

Let’s start querying this.

 hive> select * from employee;
 OK
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 Time taken: 2.612 seconds, Fetched: 10 row(s)

 hive> select * from employee limit 5;
 OK
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 Time taken: 0.136 seconds, Fetched: 5 row(s)

Querying with Map Reduce

Aggregation functions use Map Reduce to give you the results. Here is one example.

 hive> select count(*) from employee;
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=<number>
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=<number>
 In order to set a constant number of reducers:
   set mapreduce.job.reduces=<number>
 Job running in-process (local Hadoop)
 2016-10-15 12:55:39,909 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_local1840713229_0001
 MapReduce Jobs Launched:
 Stage-Stage-1:  HDFS Read: 1584 HDFS Write: 528 SUCCESS
 Total MapReduce CPU Time Spent: 0 msec
 OK
 10
 Time taken: 2.535 seconds, Fetched: 1 row(s)

Store the output

The following statement would save the output to local Unix FS.

 hive> INSERT OVERWRITE LOCAL DIRECTORY '/opt/hadoop/hivefarm/mr-result.csv' SELECT * FROM employee;
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
 Query ID = hadoop_20161015125721_f2ac0333-f444-47af-b3c7-57a079d6ca2a
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Job running in-process (local Hadoop)
 2016-10-15 12:57:23,074 Stage-1 map = 100%,  reduce = 0%
 Ended Job = job_local783745060_0002
 Moving data to local directory /opt/hadoop/hivefarm/mr-result.csv
 MapReduce Jobs Launched:
 Stage-Stage-1:  HDFS Read: 1056 HDFS Write: 264 SUCCESS
 Total MapReduce CPU Time Spent: 0 msec
 OK
 Time taken: 1.44 seconds

Let’s verify this output.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ cat mr-result.csv/000000_0
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor

Now let’s store the output to HDFS itself.

hive> INSERT OVERWRITE DIRECTORY '/user/hadoop/lab27' SELECT * FROM employee;

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hadoop/lab27
Found 1 items
-rwxr-xr-x   3 hadoop supergroup        254 2016-10-15 13:05 /user/hadoop/lab27/000000_0

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -cat /user/hadoop/lab27/000000_0
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor

Drop

Finally, we can drop table or database as any other SQL.

hive> drop table employee;
OK
Time taken: 2.405 seconds
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s