Distributed HBASE & ZooKeeper Installation and Configuration

Hi,

I’m happy to share with you the output of an interesting lab exercise. Let’s install Hbase and ZooKeeper and issue some commands in hbase shell in this post.

2000px-wikipedia-logo-v2-en-svg1HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation‘s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a distributed hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.[1] ZooKeeper was a sub-project of Hadoop but is now a top-level project in its own right.

HBASE download and configuration

hadoop@gandhari:/opt/hadoop-2.6.4$ wget https://archive.cloudera.com/cdh5/cdh/5/hbase-1.0.0-cdh5.5.1.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ gunzip hbase-1.0.0-cdh5.5.1.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ tar -xvf hbase-1.0.0-cdh5.5.1.tar

hadoop@gandhari:/opt/hadoop-2.6.4$ ln -s hbase-1.0.0-cdh5.5.1/ hbase

hadoop@gandhari:/opt/hadoop-2.6.4$ mkdir /tmp/hbase

hadoop@gandhari:/opt/hadoop-2.6.4$ cd hbase/conf

export HBASE_MANAGES_ZK=false
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_HOME=/opt/hadoop


hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ vi ~/.bashrc

#HBASE VARIABLES
export HBASE_HOME=/opt/hadoop/hbase
export PATH=$PATH:$HBASE_HOME/bin


hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ source ~/.bashrc

ZooKeeper – Download and Configuration

zookeeper

hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ cd $HOME
hadoop@gandhari:~$ pwd
/opt/hadoop

hadoop@gandhari:~$ wget https://archive.cloudera.com/cdh5/cdh/5/zookeeper-3.4.5-cdh5.5.1.tar.gz

hadoop@gandhari:~$ gunzip zookeeper-3.4.5-cdh5.5.1.tar.gz

hadoop@gandhari:~$ tar -xvf zookeeper-3.4.5-cdh5.5.1.tar

hadoop@gandhari:~$ ln -s zookeeper-3.4.5-cdh5.5.1/ zookeeper

hadoop@gandhari:~$ cd zookeeper
hadoop@gandhari:~/zookeeper$ mkdir zookeeper

hadoop@gandhari:~/zookeeper$ cd conf

hadoop@gandhari:~/zookeeper/conf$ cp zoo_sample.cfg zoo.cfg

hadoop@gandhari:~/zookeeper/conf$ vi zoo.cfg

Add the following entries to zoo.cfg

dataDir=$HOME/zookeeper/zookeeper
server.0=gandhari:2888:3888


hadoop@gandhari:~/zookeeper/conf$ cp zoo.cfg /opt/hadoop/hbase/conf/

Port the zookeeper configuration to Hbase.

Create a myid file and put it in the dataDir folder of zookeeper with an entry 0, to denote the server instance number.

hadoop@gandhari:~/zookeeper$ touch myid

hadoop@gandhari:~/zookeeper$ echo '0'> /opt/hadoop/zookeeper/zookeeper/myid

hadoop@gandhari:/etc/hadoop/conf$ cd /opt/hadoop/etc/hadoop/

hadoop@gandhari:~/etc/hadoop$ cp core-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp hdfs-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp yarn-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp mapred-site.xml.template /opt/hadoop/hbase/conf/mapred-site.xml

Reconfigure HBASE with ZooKeeper

hadoop@gandhari:~/etc/hadoop$ cd $HOME
hadoop@gandhari:~$ cd hbase/conf/

hadoop@gandhari:~/hbase/conf$ vi hbase-site.xml

<configuration>
 <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
 <property>
    <name>hbase.regionserver.hlog.replicationd</name>
    <value>1</value>
  </property>
 <property>
    <name>hbase.tmp.dir</name>
    <value>/tmp/hbase</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://gandhari:9000/hbase</value>
  </property>
 <property>
  <name>hbase.zookeeper.quorum</name>
  <value>gandhari:2181</value>
 </property>
 <property>
  <name>zookeeper.session.timeout</name>
  <value>15000</value>
 </property>
 <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/hadoop/zookeeper/zookeeper</value>
    <description>Property from ZooKeeper config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
</configuration>

Lets start Zookeeper

hadoop@gandhari:~/hbase/conf$ cd /opt/hadoop/zookeeper/bin

hadoop@gandhari:~/zookeeper/bin$ ./zkServer.sh start

Now lets start the HBASE

hadoop@gandhari:~/zookeeper/bin$ cd /opt/hadoop/hbase/bin/

hadoop@gandhari:~/hbase/bin$ ./hbase-daemon.sh start master

hadoop@gandhari:~/hbase/bin$ hbase-daemon.sh start regionserver

Let’s start the HBASE shell

hadoop@gandhari:~/hbase/bin$ ./hbase shell

hbase(main):001:0> status

ERROR: Can't get master address from ZooKeeper; znode data == null

This error denotes that hadoop daemons are not running. Make sure you have started the servers

$jps

10693 NodeManager
 10229 DataNode
 10086 NameNode
 11254 HRegionServer
 10936 JobHistoryServer
 10569 ResourceManager
 11131 HMaster
 10411 SecondaryNameNode
 11356 Jps
 11070 QuorumPeerMain

hbase(main):001:0> status
 1 servers, 0 dead, 2.0000 average load

hbase(main):002:0> status 'simple'
 1 live servers
 gandhari:60020 1472360461080
 requestsPerSecond=0.0, numberOfOnlineRegions=2, usedHeapMB=26, maxHeapMB=1958, numberOfStores=2, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=8, writeRequestsCount=5, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
 0 dead servers
 Aggregate load: 0, regions: 2

hbase(main):003:0> status 'summary'
 1 servers, 0 dead, 2.0000 average load

hbase(main):005:0> status 'detailed'
 version 1.0.0-cdh5.5.1
 0 regionsInTransition
 master coprocessors: []
 1 live servers
 gandhari:60020 1472360461080
 requestsPerSecond=0.0, numberOfOnlineRegions=2, usedHeapMB=26, maxHeapMB=1958, numberOfStores=2, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=8, writeRequestsCount=5, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
 "hbase:meta,,1"
 numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=2, writeRequestsCount=3, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, completeSequenceId=-1, dataLocality=0.0
 "hbase:namespace,,1472360489768.21310113a36cdc875d33fdac0b6060fd."
 numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=6, writeRequestsCount=2, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, completeSequenceId=-1, dataLocality=0.0
 0 dead servers

hbase(main):006:0> list
 TABLE
 0 row(s) in 0.0400 seconds

=> []

 

 

 

 

Advertisements

Flume Installation and Configuration

Hi,

flume-mini-logo

Here is an other exercise in my course.

2000px-wikipedia-logo-v2-en-svg1

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Download and Extract

hadoop@gandhari:/opt/hadoop-2.6.4$ wget https://repository.cloudera.com/artifactory/public/org/apache/flume/flume-ng-dist/1.6.0-cdh5.5.1/flume-ng-dist-1.6.0-cdh5.5.1-bin.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ gunzip flume-ng-dist-1.6.0-cdh5.5.1-bin.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ tar -xvf flume-ng-dist-1.6.0-cdh5.5.1-bin.tar

 hadoop@gandhari:~$ ln -s apache-flume-1.6.0-cdh5.5.1-bin/ flume

hadoop@gandhari:~$ vi .bashrc

#FLUME VARIABLES
export FLUME_HOME=/opt/hadoop/flume
export PATH=$PATH:$FLUME_HOME/bin
export FLUME_CONF_DIR=/etc/hadoop/conf
export FLUME_CLASSPATH=/etc/hadoop/conf

hadoop@gandhari:~$ source .bashrc

Flume setup

hadoop@gandhari:~$ cd flume

hadoop@gandhari:~/flume$ mkdir logs

hadoop@gandhari:~/flume$ cd conf/

hadoop@gandhari:~/flume/conf$ cp flume-conf.properties.template flume.conf


hadoop@gandhari:~/flume/conf$ vi flume.conf

agent.sources = avroSrc
agent.channels = memoryChannel
agent.sinks = loggerSink hdfs-sink

# For each one of the sources, the type is defined
agent.sources.avroSrc.type = exec
agent.sources.avroSrc.port = 3631
agent.sources.avroSrc.threads = 2
agent.sources.avroSrc.bind=0.0.0.0
agent.sources.avroSrc.command = tail -f /opt/hadoop/logs/test.log

# The channel can be defined as follows.
agent.sources.avroSrc.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.loggerSink.type = logger

#Specify the channel the sink should use
agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.hdfs-sink.hdfs.path=hdfs://gandhari:9000/test/flume
agent.sinks.hdfs-sink.type=hdfs
agent.sinks.hdfs-sink.channel=memoryChannel
agent.sinks.hdfs-sink.hdfs.fileType=DataStream
agent.sinks.hdfs-sink.hdfs.rollInterval=1
agent.sinks.hdfs-sink.hdfs.writeFormat=Text

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100

Execution

hadoop@gandhari:~/flume/bin$ flume-ng agent --name agent --conf-file ../conf/flume.conf -Dflume.root.logger=DEBUG,console >> /opt/hadoop/logs/test.log

 

 

Oozie Installation and Configuration

Hi,

Here is the output of my latest lab exercise – Oozie installation. This process was not strightforward as the steps given by master doesn’t work as expected. I need to manually tune the SQL to make it working.

2000px-wikipedia-logo-v2-en-svg1Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.

Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions including Hadoop MapReduce, Hadoop distributed file system operations, Pig, SSH, and email. Oozie can also be extended to support additional types of actions

Here are the steps.

Download and extract

hadoop@gandhari:~$ wget http://archive.cloudera.com/cdh5/cdh/5/oozie-4.0.0-cdh5.1.0.tar.gz

hadoop@gandhari:~$ gunzip oozie-4.0.0-cdh5.1.0.tar.gz

hadoop@gandhari:~$ tar -xvf oozie-4.0.0-cdh5.1.0.tar

hadoop@gandhari:~$ ln -s oozie-4.0.0-cdh5.1.0/ oozie

hadoop@gandhari:~$ ls oozie
 bin             examples     README.txt          webapp
 builds          hadooplibs   release-log.txt     workflowgenerator
 client          LICENSE.txt  sharelib            work.log
 core            login        source-headers.txt  zookeeper-security-tests
 DISCLAIMER.txt  minitest     src
 distro          NOTICE.txt   tools
 docs            pom.xml      utils

Setting up the Oozie MySQL user

hadoop@gandhari:~$ mysql -u root -p

mysql> CREATE DATABASE oozie;

mysql> USE oozie;

mysql> CREATE USER 'oozie' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON *.* TO 'oozie';

mysql> GRANT ALL ON *.* TO 'oozie'@'gandhari' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL ON *.* TO 'oozie'@'%' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL ON oozie.* TO 'oozie'@'%' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON *.* TO 'oozie'@'gandhari' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON *.* TO 'oozie'@'192.168.0.169' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON oozie.* TO 'oozie'@'192.168.0.169' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON *.* TO 'oozie'@'127.0.0.1' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON oozie.* TO 'oozie'@'127.0.0.1' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON oozie.* TO 'oozie'@'gandhari' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON oozie.* TO 'oozie'@'%' IDENTIFIED BY 'P@ssw0rd';
 Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL privileges ON *.* TO '%'@'%' IDENTIFIED BY 'P@ssw0rd';

mysql> GRANT ALL privileges ON *.* TO '*'@'*' IDENTIFIED BY 'P@ssw0rd';

mysql> FLUSH PRIVILEGES;

mysql> exit

Oozie portal settings

hadoop@gandhari:~$ pwd
 /opt/hadoop

hadoop@gandhari:~$ cd etc/hadoop/

hadoop@gandhari:~$ cd etc/hadoop/

hadoop@gandhari:~/etc/hadoop$ vi core-site.xml

#OOZIE
 <property>
 <name>hadoop.proxyuser.oozie.hosts</name>
 <value>*</value>
 </property>
 <property>
 <name>hadoop.proxyuser.oozie.groups</name>
 <value>*</value>
 </property>

Creating Oozie database

hadoop@gandhari:~/oozie$ cd /opt/hadoop

hadoop@gandhari:~$ cd oozie/bin

hadoop@gandhari:~/oozie/bin$ ./ooziedb.sh create -run
 setting CATALINA_OPTS="$CATALINA_OPTS -Xmx1024m"

Validate DB Connection
 DONE
 Check DB schema does not exist
 DONE
 Check OOZIE_SYS table does not exist
 DONE
 Create SQL schema
 DONE
 Create OOZIE_SYS table
 DONE

Oozie DB has been created for Oozie version '4.0.0-cdh5.1.0'

The SQL commands have been written to: /tmp/ooziedb-5275812012387848818.sql

This process had many errors. The default value given with this script is faulty. I need to change everything to CURRENT_TIMESTAMP to make it working.

extjs

extjs script is not bundled with Oozie due to license limitation. Hence we need to add it separately.

hadoop@gandhari:~/oozie/bin$ mkdir /opt/hadoop/extjs

hadoop@gandhari:~/oozie/bin$ cd /opt/hadoop/extjs

hadoop@gandhari:~/extjs$ wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip

Setting up Oozie portal

Let’s build the war file first.

hadoop@gandhari:~/extjs$ cd /opt/hadoop/oozie/bin/

hadoop@gandhari:~/oozie/bin$ ./addtowar.sh -inputwar ../oozie.war -outputwar ../oozieout.war -extjs /opt/hadoop/extjs/ext-2.2.zip -hadoopJarsSNAPSHOT ../oozie-hadooplibs-4.0.0-cdh5.1.0.tar.gz -hadoop 2.6.4 $HADOOP_HOME ../oozie-sharelib-4.0.0-cdh5.1.0-yarn.tar.gz -jars /opt/hadoop/hive/lib/mysql-connector-java-5.1.38.jar

...

New Oozie WAR file with added 'Hadoop JARs, ExtJS library, JARs' at ../oozieout.war

hadoop@gandhari:~/oozie/bin$ cd ..
 hadoop@gandhari:~/oozie$ ls *.war
 oozieout.war  oozie.war

Let’s copy the war file to Oozie Tomcat’s webapps folder. MySQL JDBC driver is needed to connect to Oozie database.

hadoop@gandhari:~/oozie$ cp oozie.war /opt/hadoop/oozie/oozie-server/webapps/

hadoop@gandhari:~/oozie$ cp /opt/hadoop/hive/lib/mysql-connector-java-5.1.38.jar /opt/hadoop/oozie/lib

hadoop@gandhari:~/oozie$ cp /opt/hadoop/hive/lib/mysql-connector-java-5.1.38.jar /opt/hadoop/oozie/libtools/

hadoop@gandhari:~/oozie$ vi conf/oozie-site.xml

<property>
 <name>oozie.service.JPAService.jdbc.driver</name>
 <value>com.mysql.jdbc.Driver</value>
 <description>
 JDBC driver class.
 </description>
 </property>

<property>
 <name>oozie.service.JPAService.jdbc.url</name>
 <value>jdbc:mysql://gandhari:3306/oozie</value>
 <description>
 JDBC URL.
 </description>
 </property>

<property>
 <name>oozie.service.JPAService.jdbc.username</name>
 <value>oozie</value>
 <description>
 DB user name.
 </description>
 </property>
 <property>
 <name>oozie.service.JPAService.jdbc.password</name>
 <value>P@ssw0rd</value>
 <description>
 DB user password.

IMPORTANT: if password is emtpy leave a 1 space string, the service trims the value,
 if empty Configuration assumes it is NULL.
 </description>
 </property>

ShareLib update

Let’s update the sharelib folder.

hadoop@gandhari:~/oozie$ bin/oozie-setup.sh sharelib create -fs hdfs://gandhari:9000 -locallib oozie-sharelib-4.0.0-cdh5.1.0-yarn.tar.gz -locallib oozie-hadooplibs-4.0.0-cdh5.1.0.tar.gz

....

the destination path for sharelib is: /user/hadoop/share/lib/lib_20160826174043
hadoop@gandhari:~/oozie$ bin/oozied.sh start

hadoop@gandhari:~/oozie$ bin/oozie admin -oozie http://gandhari:11000/oozie -sharelibupdate hdfs://gandhari:9000/user/hadoop/share/lib/lib_20160826174043
 null

hadoop@gandhari:~/oozie$ bin/oozie admin -shareliblist -oozie http://gandhari:11000/oozie
 [Available ShareLib]
 hive
 distcp
 mapreduce-streaming
 oozie
 hcatalog
 hive2
 sqoop
 pig

hadoop@gandhari:~/oozie$ bin/oozied.sh stop

hadoop@gandhari:~/oozie$ bin/oozied.sh start

Point your browser to http://gandhari:11000/oozie/ to get the console

hadoop008 - oozie

 

Sqoop Installation and Configuration

2000px-wikipedia-logo-v2-en-svg1

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop. Sqoop became a top-level Apache project in March 2012.

Download and Extract

hadoop@gandhari:~$ wget http://download.nus.edu.sg/mirror/apache/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

hadoop@gandhari:~$ gunzip sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

hadoop@gandhari:~$ tar -xvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar

hadoop@gandhari:~$ ln -s sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

.bashrc and other environmental changes

#SQOOP VARIABLES
export SQOOP_HOME=/opt/hadoop/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

hadoop@gandhari:~$ source ~/.bashrc

Sqoop config

hadoop@gandhari:~$ cd sqoop
hadoop@gandhari:~/sqoop$ cd conf/
hadoop@gandhari:~/sqoop/conf$ ls
oraoop-site-template.xml  sqoop-env-template.sh    sqoop-site.xml
sqoop-env-template.cmd    sqoop-site-template.xml

hadoop@gandhari:~/sqoop/conf$ cp sqoop-env-template.sh sqoop-env.sh

export HADOOP_COMMON_HOME=/opt/hadoop
export HADOOP_MAPRED_HOME=/opt/hadoop

hadoop@gandhari:~/sqoop/conf$ cp /usr/share/java/mysql-connector-java-5.1.38.jar /opt/hadoop/sqoop/lib/

Execution

hadoop@gandhari:~/sqoop/conf$ cd ..
hadoop@gandhari:~/sqoop$ sqoop-version
Warning: /opt/hadoop/sqoop/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/hadoop/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/hadoop/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/hadoop/sqoop/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/08/24 15:24:09 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Sqoop 1.4.6
git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:38:36 CST 2015

 

 

 

Hive installation & configuration

After Hadoop – psedodistributed mode installation – second time, this is our next ICT job as part of my course. Let’s install and test Hive. This would be a continuation of the Hadoop installation. Hence I’d be following the folder structures, usernames as given in the previous post.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.

2000px-wikipedia-logo-v2-en-svg

Download and Install

hadoop@gandhari:/opt/hadoop-2.6.4$ wget http://download.nus.edu.sg/mirror/apache/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
hadoop@gandhari:/opt/hadoop-2.6.4$ gunzip apache-hive-2.1.0-bin.tar.gz
hadoop@gandhari:/opt/hadoop-2.6.4$ -xvf apache-hive-2.1.0-bin.tar
hadoop@gandhari:/opt/hadoop-2.6.4$ ln -s apache-hive-2.1.0-bin/ hive

Setup Environment – .bashrc changes

Make the following .bashrc file
#HIVE VARIABLES
export HIVE_HOME=/opt/hadoop/apache-hive-2.1.0-bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH=$PATH:$HIVE_HOME/bin

Setup Environment – Creating directory structure

hadoop@gandhari:~$ hadoop fs -mkdir /tmp
mkdir: Call From gandhari/192.168.0.169 to gandhari:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

DFS and Yarn should be running to setup on hive.

hadoop@gandhari:~$ start-dfs.sh
hadoop@gandhari:~$ start-yarn.sh
hadoop@gandhari:~$ hadoop fs -mkdir /tmp
mkdir: `/tmp’: File exists
hadoop@gandhari:~$ hadoop fs -mkdir /user
hadoop@gandhari:~$ hadoop fs -mkdir /user/hive
hadoop@gandhari:~$ hadoop fs -mkdir /user/hive/warehouse
hadoop@gandhari:~$ hadoop fs -chmod g+w /tmp
hadoop@gandhari:~$ hadoop fs -chmod g+w /user/hive/warehouse

Install MySQL Server

hadoop@gandhari:~$ sudo apt-get install mysql-server
hadoop@gandhari:~$ sudo /etc/init.d/mysql start
[ ok ] Starting mysql (via systemctl): mysql.service.
hadoop@gandhari:~$ sudo apt-get install mysql-client
hadoop@gandhari:~$ sudo apt-get install libmysql-java
hadoop@gandhari:~$ cp /usr/share/java/mysql.jar $HIVE_HOME
hadoop@gandhari:~$ cp /usr/share/java/mysql-connector-java-5.1.38.jar /opt/hadoop/hive/lib/
hadoop@gandhari:~$ /usr/bin/mysql_secure_installation

Creating  Hive database

hadoop@gandhari:~/apache-hive-2.1.0-bin$ mysql -u root -p
Enter password:

mysql> CREATE DATABASE metastore;
Query OK, 1 row affected (0.00 sec)
mysql> USE metastore;
Database changed

mysql> SOURCE /opt/hadoop-2.6.4/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql
mysql> CREATE USER hive@gandhari IDENTIFIED BY ‘P@ssw0rd’;
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM hive@gandhari;
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO hive@gandhari;
mysql> FLUSH PRIVILEGES;
mysql> GRANT ALL ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT SELECT,INSERT,UPDATE,SELECT ON *.* TO ‘hive’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’localhost’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON *.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL PRIVILEGES ON *.* TO ‘hive’@’gandhari’ IDENTIFIED BY ‘P@ssw0rd’;
mysql>  GRANT ALL PRIVILEGES ON *.* TO ‘hive’@’192.168.0.169’ IDENTIFIED BY ‘P@ssw0rd’;
mysql>  GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘%’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON *.* TO ‘*’@’*’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> FLUSH PRIVILEGES;

Grant all permissions to Hive user

mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’127.0.0.1’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’gandhari’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> GRANT ALL privileges ON metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘P@ssw0rd’;
mysql> FLUSH PRIVILEGES;
mysql> exit;

Creating Hive config file

hadoop@gandhari:~/hive/conf$ cp hive-default.xml.template hive-site.xml

hadoop@gandhari:~/hive/conf$ vi hive-site.xml

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>P@ssw0rd</value>
</property>

<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>

<property>
<name>hive.stats.autogather</name>
<value>false</value>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>

<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/tmp</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/operation_logs</value>
</property>

We’ll be launching hive shortly. Let’s make sure the demons are running

hadoop@gandhari:~$ jps
7410 ResourceManager
6931 NameNode
7254 SecondaryNameNode
7046 DataNode
7527 NodeManager
7817 Jps

 Creating demo table and test

hive> CREATE TABLE demo1 (id int, name string);
OK
Time taken: 1.448 seconds
hive> SHOW TABLES;
OK
demo1
Time taken: 0.195 seconds, Fetched: 1 row(s)
hive> select count(*) from demo1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20160823145925_c4271279-c5c0-4948-a1c3-fb6f79718b5d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-08-23 14:59:29,802 Stage-1 map = 0%,  reduce = 100%
Ended Job = job_local1827679072_0001
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
0
Time taken: 4.525 seconds, Fetched: 1 row(s)
hive>

I faced the following issues –

Hadoop – psedodistributed mode installation – second time

I was waiting for a computing machine for hadoop. Unfortunatley I couldn’t get it for the past two months due to multiple commitments. Kannan and I visited one of the local computer stores before 2 weeks. I selected a Dell based tower based desktop. But for the desired config (i7/16GB RAM/500 GB) it is going out of my budget.

I lost my hope and postponed the plan. I got a old model laptop with high config in a local expo. It doesn’t have modern features like touch screen, SSD harddisk. But I’m ok. I named it after Jeyamohan’s novel on Krishna – Neelam! (Neelam=Blue)

 

Here are the steps I followed to create the Hadoop environment. This is more precised than my earlier post.

Here is the summary of Hadoop – psedodistributed mode installation. This is my 2nd post regarding the environmental setup.

System Specs

  • OS: Ubuntu 64 bit/VMware Workstation Player
  • RAM: 8 GB
  • CPU: 4
  • Java: 1.8
  • Hadoop: 2.6

Update Ubuntu

Let’s update ubuntu first before starting the process. This may take much time based on your update frequency.

The following command will update the package definitions.

pandian@kunthi:~$ sudo apt-get update
...
...
Fetched 1,646 kB in 8s (204 kB/s)
AppStream cache update completed, but some metadata was ignored due to errors.
Reading package lists... Done

The following command will update the packages

pandian@kunthi:~$ sudo apt-get dist-upgrade
...
...
355 upgraded, 5 newly installed, 0 to remove and 0 not upgraded.
Need to get 295 MB/465 MB of archives.
After this operation, 279 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
...
...

<It is time consuming. Take a break.>

Installing JDK

With reference to http://askubuntu.com/questions/521145/how-to-install-oracle-java-on-ubuntu-14-04 follow the below given instructions to install JDK

pandian@kunthi:~$ sudo apt-add-repository ppa:webupd8team/java
pandian@kunthi:~$ sudo apt-get update
pandian@kunthi:~$ sudo apt-get install oracle-java8-installer
pandian@kunthi:~$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) Client VM (build 25.101-b13, mixed mode)
pandian@kunthi:~$ whereis java
java: /usr/bin/java /usr/share/java /usr/share/man/man1/java.1.gz

Create User and User Group

Let’s run Hadoop with its own user and user group.

pandian@kunthi:~$ sudo groupadd -g 599 hadoop
pandian@kunthi:~$ sudo useradd -u 599 -g 599 hadoop

Directory structure

Let’s create the directory structure

pandian@kunthi:~$ sudo mkdir -p /var/lib/hadoop/journaldata
pandian@kunthi:~$ sudo chown hadoop:hadoop -R /var/lib/hadoop/journaldata

User access and sudo privilage

We are still doing linux tasks. We haven’t touched Hadoop part yet.

pandian@kunthi:~$ sudo passwd hadoop
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
pandian@kunthi:/opt/software/hadoop$ sudo su
root@kunthi:/home/pandian# cp /etc/sudoers /etc/sudoers.20160820
root@kunthi:~# vi /etc/sudoers

I made the highlighted change.

# User privilege specification
root ALL=(ALL:ALL) ALL
hadoop ALL=(ALL:ALL) ALL

root@kunthi:~# cd /opt

root@kunthi:~# wget http://download.nus.edu.sg/mirror/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

root@kunthi:~# gunzip hadoop-2.6.4.tar.gz

root@kunthi:~# tar -xvf hadoop-2.6.4.tar.gz
root@gandhari:/opt# ln -s /opt/hadoop-2.6.4 hadoop
root@gandhari:/opt# chown hadoop:hadoop hadoop
root@gandhari:/opt# chown hadoop:hadoop -R hadoop-2.6.4
root@gandhari:/opt# usermod -d /opt/hadoop hadoop

root@kunthi:~# exit
pandian@kunthi:~$ su - hadoop
$ pwd
/opt/hadoop
$ bash
hadoop@kunthi:~$ id
uid=1001(hadoop) gid=599(hadoop) groups=599(hadoop)

Hadoop

Lets create the configuration directory for Hadoop.
hadoop@kunthi:~$ sudo mkdir -p /etc/hadoop/conf
Create a softlink for the conf folder
hadoop@kunthi:~$ sudo ln -s /opt/hadoop/hadoop-2.6.4/etc/hadoop/** /etc/hadoop/conf/

SSH Keys creation

Hadoop wants to create key based SSH login
hadoop@kunthi:~$ mkdir ~/.ssh
hadoop@kunthi:~$ cd ~/.ssh/
hadoop@kunthi:~/.ssh$ touch authorized keys
hadoop@kunthi:~/.ssh$ touch known hosts
hadoop@kunthi:~/.ssh$ chmod 700 ~/.ssh/&& chmod 600 ~/.ssh/*
hadoop@gandhari:/opt/hadoop-2.6.4$ ssh gandhari
The authenticity of host 'gandhari (192.168.0.169)' can't be established.
ECDSA key fingerprint is SHA256:Y/ed5Le/5xqY1ImoVZBsSF7irydJRUn2TNwPBow4uSA.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'gandhari,192.168.0.169' (ECDSA) to the list of known hosts.
hadoop@gandhari's password:
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-34-generic x86_64)

Bash profile – Environmental variables

As the home folder of the unix user is created by me manually, I need to create the bashprofile. I’ll get a copy of the bash profile, which is working for another user
hadoop@kunthi:~$ sudo cp /home/pandian/.bash* .
I’ll modify the above environmental variables to .bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_HOME=/opt/hadoop/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HADOOP_USER_CLASSPATH_FIRST=true
export HADOOP_PREFIX=$HADOOP_HOME
export JAVA_HOME HADOOP_HOME HADOOP_MAPRED_HOME HADOOP_COMMON_HOME HADOOP_HDFS_HOME PATH HADOOP_LOG_DIR

Let’s apply the changes to current session
hadoop@kunthi:~$ source ~/.bashrc

Hadoop env config

Let’s specify JAVA_HOME
hadoop@kunthi:~/hadoop/etc/hadoop$ cd $HADOOP_HOME/etc/hadoop/
hadoop@kunthi:~/hadoop/etc/hadoop$ cp hadoop-env.sh hadoop-env.sh.20160821

I made the following changes to hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

Setup passwordless ssh login

hadoop@kunthi:~/hadoop$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/opt/hadoop/.ssh/id_rsa):
Your identification has been saved in /opt/hadoop/.ssh/id_rsa.
Your public key has been saved in /opt/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:UXGO3tnfK9K8DayD0/jc+T/WgZetCHOuBAcssUw3gBo hadoop@kunthi
The key's randomart image is:
+---[RSA 2048]----+
| .+.o o.. |
| E .o = o + |
| o + + . . |
| . . + . o |
| S o o o o|
| oo o. =o|
| ==o+..=|
| =.+=+=oo|
| +=o+=++|
+----[SHA256]-----+
hadoop@kunthi:~/hadoop$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop@kunthi:~/hadoop$ sudo /etc/init.d/ssh restart
[ ok ] Restarting ssh (via systemctl): ssh.service.
hadoop@kunthi:~/hadoop$ ssh hadoop@gandhari
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-34-generic x86_64)

Temp folders for Hadoop

hadoop@gandhari:/opt/hadoop-2.6.4$ sudo mkdir -p /var/lib/hadoop/cache
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/data
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chmod 750 /var/lib/hadoop/cache/
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/name
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/data
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/name/
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs

$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/namesecondary

$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/data

hadoop@gandhari:/opt/hadoop-2.6.4$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/mapred/local
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/name
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/mapred/local/
hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/namesecondary/

hadoop@gandhari:/opt/hadoop-2.6.4$ sudo chown hadoop:hadoop -R /etc/hadoop/

Define the slave name

Add slave hostname. After change, this is the slave name. It is similar to my hostname
hadoop@kunthi:~/hadoop$ cat /etc/hadoop/conf/slaves
kunthi

core-site.xml

Make the appropriate changes core-site.xml

hadoop@kunthi:~/hadoop$ cat etc/hadoop/core-site.xml
<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://gandhari:9000</value>
        </property>
</configuration>

hadoop executable

Check if hadoop command is working. It is located inside $HADOOP_HOME/bin folder

hadoop@kunthi:~$ cd $HADOOP_HOME
hadoop@kunthi:~/hadoop$ hadoop version
Hadoop 2.6.4
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6
Compiled by jenkins on 2016-02-12T09:45Z
Compiled with protoc 2.5.0
From source with checksum 8dee2286ecdbbbc930a6c87b65cbc010
This command was run using /opt/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4.jar

hdfs-site.xml

hadoop@kunthi:~/hadoop$ cp etc/hadoop/hdfs-site.xml etc/hadoop/hdfs-site.xml.20160820

I made the folllowing changes

<configuration>
<property>
<name>dfs.name.dir</name>
<value>/var/lib/hadoop/cache/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/var/lib/hadoop/cache/hadoop/dfs/data</value>
</property>
</configuration>

Formatting and starting the namenode

hadoop@kunthi:~/hadoop$ hadoop namenode -format
.....
16/08/20 09:15:09 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = gandhari/192.168.0.169
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.6.4
....
16/08/20 09:15:10 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted.
....
16/08/20 09:15:10 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at gandhari/192.168.0.169
************************************************************/
hadoop@kunthi:~/hadoop/sbin$ start-dfs.sh
hadoop@kunthi:~/hadoop/sbin$ start-yarn.sh
hadoop@kunthi:~/hadoop/sbin$ jps
6290 DataNode
6707 NodeManager
6599 ResourceManager
6459 SecondaryNameNode
6155 NameNode
7003 Jps
hadoop@kunthi:~/hadoop/sbin$ ./mr-jobhistory-daemon.sh start historyserver

Access the job tracker, name node and data node using your browser as shown below

Job History: http://gandhari:19888/

hadoop004 - jobhistory

Name Node: http://gandhari:50070/

hadoop005 - namenode information

 

Data Node: http://gandhari:50075/

hadoop006 - datanode information

 All applications http://gandhari:8088/cluster

hadoop007 - all applications