Lab 30: Hive Queries

Hi hadoopers,

I have a program that will extract RSS feeds from different sources in tab limited text file. I used to Hive to do some mining today. Let’s see the results.

pighive3

The file has 12 fields separated by tab. Here is the table description.


CREATE external TABLE IF NOT EXISTS feed_article (feedgenerator STRING,feedtitle STRING,feed_author STRING,feed_url STRING,feed_time STRING,item_subject STRING,item_author STRING,itemurl STRING,itemdate STRING,category STRING,DescriptionFile STRING,uniqueId bigint) ROW FORMAT DELIMITED     FIELDS TERMINATED BY '\t'      STORED AS TEXTFILE     LOCATION '/user/hadoop/lab27';

hive> describe feed_article;
OK
feedgenerator           string
feedtitle               string
feed_author             string
feed_url                string
feed_time               string
item_subject            string
item_author             string
itemurl                 string
itemdate                string
category                string
descriptionfile         string
uniqueid                bigint
Time taken: 0.062 seconds, Fetched: 12 row(s)

Count how many articles published today.


hive> select count(*) from feed_article;
OK
3699
Time taken: 1.624 seconds, Fetched: 1 row(s)

List of distinct authors today.


hive> select distinct item_author from feed_article;
,Alok Deshpande
-தி.இன்பராஜ்-
-பா.ராஜா
A.D.Balasubramaniyan
A.T.S Pandian
AFP
AP
Aekaanthan
Aishwarya Parikh
Akanksha Jain
Alex Barile

Lets see which site has lot of articles


hive> select feedtitle, count(*) from feed_article group by feedtitle;
NULL    139
A Wandering Mind        1
APMdigest Hot Topics: APM       2
Application Performance Monitoring Blog | AppDynamics   1
BSNLTeleServices | BSNL Broadband Plans, Bill Payment Selfcare Portal   3
Bangalore Aviation      1
Blog Feed       1
Cloudera Engineering Blog       1
DailyThanthi.com        20

Who wrote many articles today?

hive> select item_author, count (*) from feed_article group by item_author order by item_author desc limit 5;
OK
ஹாவேரி, 1
ஹரி கிருஷ்ணன்     14
ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  4
ஷங்கர்    2
Time taken: 2.476 seconds, Fetched: 5 row(s)

Author of which website wrote many article today?

hive> hive> select item_author, feedtitle, count (*) from feed_article group by item_author, feedtitle order by item_author desc limit 10;
ஹாவேரி, Dinamani - பெங்களூரு - http://www.dinamani.com/all-editions/edition-bangalore/ 1
ஹரி கிருஷ்ணன்     Dinamani - தினந்தோறும் திருப்புகழ் - http://www.dinamani.com/specials/dinanthorum-thirupugal/     14
ஹரன் பிரசன்னா     ஹரன் பிரசன்னா     2
ஸ்கிரீனன்  தி இந்து - முகப்பு        1
ஸ்கிரீனன்  தி இந்து - தமிழ் சினிமா   1
ஸ்கிரீனன்  தி இந்து - சினிமா        2
ஷங்கர்    தி இந்து - சினிமா        1
ஷங்கர்    தி இந்து - முகப்பு        1
வெங்கடேசன். ஆர்    Dinamani - வேலைவாய்ப்பு - http://www.dinamani.com/employment/  32
வெங்கடேசன். ஆர்    Dinamani - விவசாயம் - http://www.dinamani.com/agriculture/    2
Time taken: 2.493 seconds, Fetched: 10 row(s)

Using which feed software the articles were published.


hive> select feedgenerator, count (*) from feed_article group by feedgenerator order by feedgenerator desc limit 10;
https://wordpress.org/?v=4.6.1  5
https://wordpress.org/?v=4.5.4  80
https://wordpress.org/?v=4.5.2  1
http://wordpress.org/?v=4.2.10  2
http://wordpress.org/?v=4.1.4   7
http://wordpress.org/?v=3.5.1   10
http://wordpress.org/?v=3.0     1
http://wordpress.com/   13
application/rss+xml     3434
Jive Engage 8.0.2.0  (http://jivesoftware.com/products/)        1
Time taken: 2.473 seconds, Fetched: 10 row(s)

Advertisements

Lab 29: External tables in Hive

Hi Hadoopers,

My previous post demonstrated how to create the tables in hive and how to retrieve the content from the same. All of those tables are ‘internal’ tables, ie., those are hive managed tables. If you drop any table or database, you will see those get deleted from the warehouse folder.

But we used to work a lot with external tables. When you drop an external table, the metastore information (only) will get deleted. The content will still be available in HDFS.

Hence this post will show you how to create external tables.

pighive3

Table creation

I have a CSV file in /user/hadoop/lab28/input. I’ll use it to create the external table.

$ hadoop fs -cat /user/hadoop/lab28/input/employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor

Let’s create the table employee_ext as given below. The structure is same as that of the one I used in my earlier blog post, except the location.

hive> CREATE external TABLE IF NOT EXISTS employee_ext (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED     FIELDS TERMINATED BY ','     STORED AS TEXTFILE     LOCATION '/user/hadoop/lab28/input';
OK
Time taken: 0.086 seconds

Let’s verify if the table is created.

hive> show tables;
OK
demo1
employee_ext
Time taken: 0.08 seconds, Fetched: 2 row(s)

Let’s check the meta store now.

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME     | TBL_TYPE       | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
|      1 |  1471989527 |     1 |                0 | hadoop |         0 |     1 | demo1        | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|      7 |  1476505836 |     6 |                0 | hadoop |         0 |     7 | employee     | MANAGED_TABLE  | NULL               | NULL               |           NULL |
|     11 |  1476519584 |     1 |                0 | hadoop |         0 |    11 | employee_ext | EXTERNAL_TABLE | NULL               | NULL               |           NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+--------------+----------------+--------------------+--------------------+----------------+
3 rows in set (0.00 sec)

The table employee_ext is registered in meta store. Let’s start querying it now.

Querying

Show all records from employee_ext

hive> select * from employee_ext;
OK
1        Dharma NULL     Sr Manager
2        Bheema NULL     Cook
3        Arjuna NULL     Instructor
4        Nakula NULL     Jr Instructor
5        Sahadeva       NULL     Jr Instructor
Time taken: 0.123 seconds, Fetched: 5 row(s)

Describing the table.

hive> describe employee_ext;
OK
eid                     int
ename                   string
eage                    int
edesig                  string
Time taken: 0.044 seconds, Fetched: 4 row(s)

hive> describe formatted employee_ext;
OK
# col_name              data_type               comment

eid                     int
ename                   string
eage                    int
edesig                  string

# Detailed Table Information
Database:               default
Owner:                  hadoop
CreateTime:             Sat Oct 15 16:26:18 MYT 2016
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://gandhari:9000/user/hadoop/lab28/input
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1476519978

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
field.delim             ,
serialization.format    ,
Time taken: 0.053 seconds, Fetched: 30 row(s)

Let’s update one table using another query result.Let’s create a new table first. This is a hive managed table.

hive> CREATE TABLE IF NOT EXISTS employee_new (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED     FIELDS TERMINATED BY ','     STORED AS TEXTFILE;
OK
Time taken: 0.101 seconds

Let’s update the new table now.

hive> INSERT OVERWRITE TABLE employee_new SELECT * FROM employee_ext limit 2;
hive> select * from employee_new;
OK
2        Bheema NULL     Cook
1        Dharma NULL     Sr Manager
Time taken: 0.078 seconds, Fetched: 2 row(s)

It is updated as expected. Let’s see what happened in HDFS. The WareHouse folder contains a MR result.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hive/warehouse/employee_new;
Found 1 items
-rwxrwxr-x   3 hadoop supergroup         44 2016-10-15 16:38 /user/hive/warehouse/employee_new/000000_0

Let’s try to cat the file.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -cat /user/hive/warehouse/employee_new/000000_0
2, Bheema,\N, Cook
1, Dharma,\N, Sr Manager

Lab 28: Getting started with Apache Hive

Hi Hadoopers,

We had an interesting journey into Apache Pig. I’m happy to start another blog post series on Apache Hive today.

I’m running short of time as I’m rushing to conclude my training schedule. So I’m do not have much time to draw the architecture diagram and explain it. When I do a revisit, I may be able to do it. Let me write about my first lab exercise in this post.

pighive3

As usual, the data would be available in HDFS. But Hive, gives a virtual tabular format using metastore database. Hence it maintains a local or remote database to store the relationship. Using this relationship, it reads the HDFS data. Hence this blog post will show you the commands and the meta store in parallel.

Show Databases

The command to see the list of tables is given below.

hive>  show databases;
OK
default
Time taken: 1.225 seconds, Fetched: 1 row(s)

The database information is stored in DBS table in meta store. Let’s see the content of DBS.

mysql> use metastore;
mysql> show tables;
+---------------------------+
| Tables_in_metastore       |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| DATABASE_PARAMS           |
| DBS                       |
| DB_PRIVS                  |
| DELEGATION_TOKENS         |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| IDXS                      |
| INDEX_PARAMS              |
| MASTER_KEYS               |
| NUCLEUS_TABLES            |
| PARTITIONS                |
| PARTITION_EVENTS          |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_PRIVS            |
| PART_COL_STATS            |
| PART_PRIVS                |
| ROLES                     |
| ROLE_MAP                  |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| TBL_COL_PRIVS             |
| TBL_PRIVS                 |
| TYPES                     |
| TYPE_FIELDS               |
| VERSION                   |
+---------------------------+
43 rows in set (0.00 sec)
mysql> select * from DBS;
+-------+-----------------------+------------------------------------------+---------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                          | NAME    | OWNER_TYPE | OWNER_NAME |
+-------+-----------------------+------------------------------------------+---------+------------+------------+
|     1 | Default Hive database | hdfs://gandhari:9000/user/hive/warehouse | default | ROLE       | public     |
+-------+-----------------------+------------------------------------------+---------+------------+------------+
1 row in set (0.00 sec)

Database Creation

Let’s create a database my name mydatabase.

 hive> create database mydatabase;
 OK
 Time taken: 0.286 secondsmysql> select * from DBS;
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                                        | NAME       | OWNER_TYPE | OWNER_NAME |
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
|     1 | Default Hive database | hdfs://gandhari:9000/user/hive/warehouse               | default    | ROLE       | public     |
|     6 | NULL                  | hdfs://gandhari:9000/user/hive/warehouse/mydatabase.db | mydatabase | USER       | hadoop     |
+-------+-----------------------+--------------------------------------------------------+------------+------------+------------+
2 rows in set (0.00 sec)

New database is created with id 6 now.

Here is the change on the HDFS side. You will see a .db folder created in the warehouse folder of Hive.

hadoop@gandhari:/opt/hadoop-2.6.4$ hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxrwxr-x   - hadoop supergroup          0 2016-08-24 05:58 /user/hive/warehouse/demo1
drwxrwxr-x   - hadoop supergroup          0 2016-10-15 12:30 /user/hive/warehouse/mydatabase.db

Table Creation

Let’s create a table now.

 hive> use mydatabase;
 OK
 Time taken: 0.016 seconds
 hive>

 hive> CREATE TABLE IF NOT EXISTS employee (eid INT, ename STRING, eage INT, edesig STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
 OK
 Time taken: 0.103 seconds

We created employee table, correct? What update did the meta store have now?

mysql> select * from TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
|      1 |  1471989527 |     1 |                0 | hadoop |         0 |     1 | demo1    | MANAGED_TABLE | NULL               | NULL               |           NULL |
|      7 |  1476505836 |     6 |                0 | hadoop |         0 |     7 | employee | MANAGED_TABLE | NULL               | NULL               |           NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+----------------+
2 rows in set (0.00 sec)

A new row is added in TBLS table for employee.

What about the change on HDFS. you see another directory is created in the warehouse folder.

hadoop@gandhari:/opt/hadoop-2.6.4$ hadoop fs -ls /user/hive/warehouse/mydatabase.db
Found 1 items
drwxrwxr-x   - hadoop supergroup          0 2016-10-15 12:30 /user/hive/warehouse/mydatabase.db/employee

Loading the content

We have a local CSV file for employees as given below.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ cat employee.csv
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor
1, Dharma, 45, Sr Manager
2, Bheema, 43, Cook
3, Arjuna, 41, Instructor
4, Nakula, 35, Jr Instructor
5, Sahadeva, 33, Jr Instructor

To load the file from unix file system to Hive, we use the below given statement.

 hive> LOAD DATA LOCAL INPATH '/opt/hadoop-2.6.4/hivefarm/employee.csv' OVERWRITE INTO TABLE employee;
 Loading data to table mydatabase.employee
 OK
 Time taken: 0.664 seconds

The CSV file is uploaded to hdfs.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hive/warehouse/mydatabase.db/employee
Found 1 items
-rwxrwxr-x   3 hadoop supergroup        264 2016-10-15 12:51 /user/hive/warehouse/mydatabase.db/employee/employee.csv

Querying

Let’s start querying this.

 hive> select * from employee;
 OK
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 Time taken: 2.612 seconds, Fetched: 10 row(s)

 hive> select * from employee limit 5;
 OK
 1        Dharma NULL     Sr Manager
 2        Bheema NULL     Cook
 3        Arjuna NULL     Instructor
 4        Nakula NULL     Jr Instructor
 5        Sahadeva       NULL     Jr Instructor
 Time taken: 0.136 seconds, Fetched: 5 row(s)

Querying with Map Reduce

Aggregation functions use Map Reduce to give you the results. Here is one example.

 hive> select count(*) from employee;
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=<number>
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=<number>
 In order to set a constant number of reducers:
   set mapreduce.job.reduces=<number>
 Job running in-process (local Hadoop)
 2016-10-15 12:55:39,909 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_local1840713229_0001
 MapReduce Jobs Launched:
 Stage-Stage-1:  HDFS Read: 1584 HDFS Write: 528 SUCCESS
 Total MapReduce CPU Time Spent: 0 msec
 OK
 10
 Time taken: 2.535 seconds, Fetched: 1 row(s)

Store the output

The following statement would save the output to local Unix FS.

 hive> INSERT OVERWRITE LOCAL DIRECTORY '/opt/hadoop/hivefarm/mr-result.csv' SELECT * FROM employee;
 WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
 Query ID = hadoop_20161015125721_f2ac0333-f444-47af-b3c7-57a079d6ca2a
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Job running in-process (local Hadoop)
 2016-10-15 12:57:23,074 Stage-1 map = 100%,  reduce = 0%
 Ended Job = job_local783745060_0002
 Moving data to local directory /opt/hadoop/hivefarm/mr-result.csv
 MapReduce Jobs Launched:
 Stage-Stage-1:  HDFS Read: 1056 HDFS Write: 264 SUCCESS
 Total MapReduce CPU Time Spent: 0 msec
 OK
 Time taken: 1.44 seconds

Let’s verify this output.

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ cat mr-result.csv/000000_0
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor

Now let’s store the output to HDFS itself.

hive> INSERT OVERWRITE DIRECTORY '/user/hadoop/lab27' SELECT * FROM employee;

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -ls /user/hadoop/lab27
Found 1 items
-rwxr-xr-x   3 hadoop supergroup        254 2016-10-15 13:05 /user/hadoop/lab27/000000_0

hadoop@gandhari:/opt/hadoop-2.6.4/hivefarm$ hadoop fs -cat /user/hadoop/lab27/000000_0
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor
1 Dharma\N Sr Manager
2 Bheema\N Cook
3 Arjuna\N Instructor
4 Nakula\N Jr Instructor
5 Sahadeva\N Jr Instructor

Drop

Finally, we can drop table or database as any other SQL.

hive> drop table employee;
OK
Time taken: 2.405 seconds

Hadoop Eco System Installation – Contents

Here is the list of pages, that can help you to install Hadoop and its ecosystem products

Distributed HBASE & ZooKeeper Installation and Configuration

Hue Installation and Configuration

Hive installation & configuration

After Hadoop – psedodistributed mode installation – second time, this is our next ICT job as part of my course. Let’s install and test Hive. This would be a continuation of the Hadoop installation. Hence I’d be following the folder structures, usernames as given in the previous post.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.

2000px-wikipedia-logo-v2-en-svg

Download and Install

hadoop@gandhari:/opt/hadoop-2.6.4$ wget http://download.nus.edu.sg/mirror/apache/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
hadoop@gandhari:/opt/hadoop-2.6.4$ gunzip apache-hive-2.1.0-bin.tar.gz
hadoop@gandhari:/opt/hadoop-2.6.4$ -xvf apache-hive-2.1.0-bin.tar
hadoop@gandhari:/opt/hadoop-2.6.4$ ln -s apache-hive-2.1.0-bin/ hive

Setup Environment – .bashrc changes

Make the following .bashrc file
#HIVE VARIABLES
export HIVE_HOME=/opt/hadoop/apache-hive-2.1.0-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$PATH:$HIVE_HOME/bin

Setup Environment – Creating directory structure

hadoop@gandhari:~$ hadoop fs -mkdir /tmp
mkdir: Call From gandhari/192.168.0.169 to gandhari:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

DFS and Yarn should be running to setup on hive.

hadoop@gandhari:~$ start-dfs.sh
hadoop@gandhari:~$ start-yarn.sh
hadoop@gandhari:~$ hadoop fs -mkdir /tmp
mkdir: `/tmp’: File exists
hadoop@gandhari:~$ hadoop fs -mkdir /user
hadoop@gandhari:~$ hadoop fs -mkdir /user/hive
hadoop@gandhari:~$ hadoop fs -mkdir /user/hive/warehouse
hadoop@gandhari:~$ hadoop fs -chmod g+w /tmp
hadoop@gandhari:~$ hadoop fs -chmod g+w /user/hive/warehouse

Install MySQL Server

hadoop@gandhari:~$ sudo apt-get install mysql-server
hadoop@gandhari:~$ sudo /etc/init.d/mysql start
[ ok ] Starting mysql (via systemctl): mysql.service.
hadoop@gandhari:~$ sudo apt-get install mysql-client
hadoop@gandhari:~$ sudo apt-get install libmysql-java
hadoop@gandhari:~$ cp /usr/share/java/mysql.jar $HIVE_HOME
hadoop@gandhari:~$ cp /usr/share/java/mysql-connector-java-5.1.38.jar /opt/hadoop/hive/lib/
hadoop@gandhari:~$ /usr/bin/mysql_secure_installation

Creating  Hive database

hadoop@gandhari:~/apache-hive-2.1.0-bin$ mysql -u root -p
Enter password:

mysql> CREATE DATABASE metastore;
Query OK, 1 row affected (0.00 sec)
mysql> USE metastore;
Database changed

mysql> SOURCE /opt/hadoop-2.6.4/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql
mysql> CREATE USER hive@gandhari IDENTIFIED BY ‘P@ssw0rd’;
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM hive@gandhari;
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO hive@gandhari;
mysql> FLUSH PRIVILEGES;
mysql> GRANT ALL ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT SELECT,INSERT,UPDATE,SELECT ON *.* TO ‘hive’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’localhost’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL PRIVILEGES ON *.* TO ‘hive’@’gandhari’ IDENTIFIED BY ‘P@ssw0rd’;
mysql>  GRANT ALL PRIVILEGES ON *.* TO ‘hive’@’192.168.0.169’ IDENTIFIED BY ‘P@ssw0rd’;
mysql>  GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘%’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘*’@’*’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> FLUSH PRIVILEGES;

Grant all permissions to Hive user

mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’gandhari’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> FLUSH PRIVILEGES;
mysql> exit;

Creating Hive config file

hadoop@gandhari:~/hive/conf$ cp hive-default.xml.template hive-site.xml

hadoop@gandhari:~/hive/conf$ vi hive-site.xml

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>P@ssw0rd</value>
</property>

<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>

<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>

<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/operation_logs</value>
</property>

We’ll be launching hive shortly. Let’s make sure the demons are running

hadoop@gandhari:~$ jps
7410 ResourceManager
6931 NameNode
7254 SecondaryNameNode
7046 DataNode
7527 NodeManager
7817 Jps

 Creating demo table and test

hive> CREATE TABLE demo1 (id int, name string);
OK
Time taken: 1.448 seconds
hive> SHOW TABLES;
OK
demo1
Time taken: 0.195 seconds, Fetched: 1 row(s)
hive> select count(*) from demo1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20160823145925_c4271279-c5c0-4948-a1c3-fb6f79718b5d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-08-23 14:59:29,802 Stage-1 map = 0%,  reduce = 100%
Ended Job = job_local1827679072_0001
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
0
Time taken: 4.525 seconds, Fetched: 1 row(s)
hive>

I faced the following issues –

Error while starting Hive – Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D

I got this error while starting Hive for the first time. WRT to

java.net.URISyntaxException when starting HIVE and AdminManual Configuration  I made the following changes to make it working


<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/operation_logs</value>
</property>

Error while Starting Hive – DatastoreDriverNotFoundException

Here is a scary exception thrown out when I started Hive.

org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.

I forgot to copy the mysql driver to hive lib folder. Here is the command to copy the same

cp /usr/share/java/mysql-connector-java-5.1.38.jar /opt/hadoop/hive/lib/