Hadoop Eco System Installation – Contents

Here is the list of pages, that can help you to install Hadoop and its ecosystem products

Distributed HBASE & ZooKeeper Installation and Configuration

Hue Installation and Configuration

Distributed HBASE & ZooKeeper Installation and Configuration

Hi,

I’m happy to share with you the output of an interesting lab exercise. Let’s install Hbase and ZooKeeper and issue some commands in hbase shell in this post.

2000px-wikipedia-logo-v2-en-svg1HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation‘s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a distributed hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.[1] ZooKeeper was a sub-project of Hadoop but is now a top-level project in its own right.

HBASE download and configuration

hadoop@gandhari:/opt/hadoop-2.6.4$ wget https://archive.cloudera.com/cdh5/cdh/5/hbase-1.0.0-cdh5.5.1.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ gunzip hbase-1.0.0-cdh5.5.1.tar.gz

hadoop@gandhari:/opt/hadoop-2.6.4$ tar -xvf hbase-1.0.0-cdh5.5.1.tar

hadoop@gandhari:/opt/hadoop-2.6.4$ ln -s hbase-1.0.0-cdh5.5.1/ hbase

hadoop@gandhari:/opt/hadoop-2.6.4$ mkdir /tmp/hbase

hadoop@gandhari:/opt/hadoop-2.6.4$ cd hbase/conf

export HBASE_MANAGES_ZK=false
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_HOME=/opt/hadoop


hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ vi ~/.bashrc

#HBASE VARIABLES
export HBASE_HOME=/opt/hadoop/hbase
export PATH=$PATH:$HBASE_HOME/bin


hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ source ~/.bashrc

ZooKeeper – Download and Configuration

zookeeper

hadoop@gandhari:/opt/hadoop-2.6.4/hbase/conf$ cd $HOME
hadoop@gandhari:~$ pwd
/opt/hadoop

hadoop@gandhari:~$ wget https://archive.cloudera.com/cdh5/cdh/5/zookeeper-3.4.5-cdh5.5.1.tar.gz

hadoop@gandhari:~$ gunzip zookeeper-3.4.5-cdh5.5.1.tar.gz

hadoop@gandhari:~$ tar -xvf zookeeper-3.4.5-cdh5.5.1.tar

hadoop@gandhari:~$ ln -s zookeeper-3.4.5-cdh5.5.1/ zookeeper

hadoop@gandhari:~$ cd zookeeper
hadoop@gandhari:~/zookeeper$ mkdir zookeeper

hadoop@gandhari:~/zookeeper$ cd conf

hadoop@gandhari:~/zookeeper/conf$ cp zoo_sample.cfg zoo.cfg

hadoop@gandhari:~/zookeeper/conf$ vi zoo.cfg

Add the following entries to zoo.cfg

dataDir=$HOME/zookeeper/zookeeper
server.0=gandhari:2888:3888


hadoop@gandhari:~/zookeeper/conf$ cp zoo.cfg /opt/hadoop/hbase/conf/

Port the zookeeper configuration to Hbase.

Create a myid file and put it in the dataDir folder of zookeeper with an entry 0, to denote the server instance number.

hadoop@gandhari:~/zookeeper$ touch myid

hadoop@gandhari:~/zookeeper$ echo '0'> /opt/hadoop/zookeeper/zookeeper/myid

hadoop@gandhari:/etc/hadoop/conf$ cd /opt/hadoop/etc/hadoop/

hadoop@gandhari:~/etc/hadoop$ cp core-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp hdfs-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp yarn-site.xml /opt/hadoop/hbase/conf/
hadoop@gandhari:~/etc/hadoop$ cp mapred-site.xml.template /opt/hadoop/hbase/conf/mapred-site.xml

Reconfigure HBASE with ZooKeeper

hadoop@gandhari:~/etc/hadoop$ cd $HOME
hadoop@gandhari:~$ cd hbase/conf/

hadoop@gandhari:~/hbase/conf$ vi hbase-site.xml

<configuration>
 <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
 <property>
    <name>hbase.regionserver.hlog.replicationd</name>
    <value>1</value>
  </property>
 <property>
    <name>hbase.tmp.dir</name>
    <value>/tmp/hbase</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://gandhari:9000/hbase</value>
  </property>
 <property>
  <name>hbase.zookeeper.quorum</name>
  <value>gandhari:2181</value>
 </property>
 <property>
  <name>zookeeper.session.timeout</name>
  <value>15000</value>
 </property>
 <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/opt/hadoop/zookeeper/zookeeper</value>
    <description>Property from ZooKeeper config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
</configuration>

Lets start Zookeeper

hadoop@gandhari:~/hbase/conf$ cd /opt/hadoop/zookeeper/bin

hadoop@gandhari:~/zookeeper/bin$ ./zkServer.sh start

Now lets start the HBASE

hadoop@gandhari:~/zookeeper/bin$ cd /opt/hadoop/hbase/bin/

hadoop@gandhari:~/hbase/bin$ ./hbase-daemon.sh start master

hadoop@gandhari:~/hbase/bin$ hbase-daemon.sh start regionserver

Let’s start the HBASE shell

hadoop@gandhari:~/hbase/bin$ ./hbase shell

hbase(main):001:0> status

ERROR: Can't get master address from ZooKeeper; znode data == null

This error denotes that hadoop daemons are not running. Make sure you have started the servers

$jps

10693 NodeManager
 10229 DataNode
 10086 NameNode
 11254 HRegionServer
 10936 JobHistoryServer
 10569 ResourceManager
 11131 HMaster
 10411 SecondaryNameNode
 11356 Jps
 11070 QuorumPeerMain

hbase(main):001:0> status
 1 servers, 0 dead, 2.0000 average load

hbase(main):002:0> status 'simple'
 1 live servers
 gandhari:60020 1472360461080
 requestsPerSecond=0.0, numberOfOnlineRegions=2, usedHeapMB=26, maxHeapMB=1958, numberOfStores=2, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=8, writeRequestsCount=5, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
 0 dead servers
 Aggregate load: 0, regions: 2

hbase(main):003:0> status 'summary'
 1 servers, 0 dead, 2.0000 average load

hbase(main):005:0> status 'detailed'
 version 1.0.0-cdh5.5.1
 0 regionsInTransition
 master coprocessors: []
 1 live servers
 gandhari:60020 1472360461080
 requestsPerSecond=0.0, numberOfOnlineRegions=2, usedHeapMB=26, maxHeapMB=1958, numberOfStores=2, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=8, writeRequestsCount=5, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
 "hbase:meta,,1"
 numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=2, writeRequestsCount=3, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, completeSequenceId=-1, dataLocality=0.0
 "hbase:namespace,,1472360489768.21310113a36cdc875d33fdac0b6060fd."
 numberOfStores=1, numberOfStorefiles=0, storefileUncompressedSizeMB=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0, readRequestsCount=6, writeRequestsCount=2, rootIndexSizeKB=0, totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0, currentCompactedKVs=0, compactionProgressPct=NaN, completeSequenceId=-1, dataLocality=0.0
 0 dead servers

hbase(main):006:0> list
 TABLE
 0 row(s) in 0.0400 seconds

=> []

 

 

 

 

Hadoop EcoSystem

Here comes another important theory after 5Vs. Yet, another interesting concept of Big data paradigm.

Inserting your data:

Sqoop/Flume – These tools would be responsible for inserting the data to the file system from various sources.

HDFS:

HDFS – The Hadoop Distributed File System, which stores the huge volume of data as small blocks across multiple nodes or servers.

HBase – This complements HDFS, where HDFS has handicaps. It offers Streaming or real time updates.

Map Reduce / YARN – This is the set of APIs to collate the data and process it to arrive at the desired result.

HCatalog – This is the ‘Directory’ service for HDFS.. This is helpful to access the data from the data nodes. It helps us to standardize the data access.

Hive/Pig – Analytics tools with Scripting

Wiring:

Oozie – This is used to create work flows

Ambari – This is used to wire the different components of Hadoop ecosystem to form a coherant operation.

Let’s talk about each one of them in detail later, if possible!