Hadoop Pseudo-Distributed Mode – Setup – Ubuntu – old post. Do not use



Here is the summary of Hadoop – psedodistributed mode installation. This is my 2nd post regarding the environmental setup.

System Specs

  • OS: Ubuntu 32 bit/VirtualBox VM
  • RAM: 4 GB
  • CPU: 1
  • Java: 1.8
  • Hadoop: 2.6

Update Ubuntu

Let’s update ubuntu first before starting the process. This may take much time based on your update frequency.

The following command will update the package definitions.

pandian@kunthi:~$ sudo apt-get update
Fetched 1,646 kB in 8s (204 kB/s)
AppStream cache update completed, but some metadata was ignored due to errors.
Reading package lists... Done

The following command will update the packages

pandian@kunthi:~$ sudo apt-get dist-upgrade
355 upgraded, 5 newly installed, 0 to remove and 0 not upgraded.
Need to get 295 MB/465 MB of archives.
After this operation, 279 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y

<It is time consuming. Take a break.>

Installing JDK

With reference to http://askubuntu.com/questions/521145/how-to-install-oracle-java-on-ubuntu-14-04 follow the below given instructions to install JDK

pandian@kunthi:~$ sudo apt-add-repository ppa:webupd8team/java
pandian@kunthi:~$ sudo apt-get update
pandian@kunthi:~$ sudo apt-get install oracle-java8-installer
pandian@kunthi:~$ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) Client VM (build 25.101-b13, mixed mode)
pandian@kunthi:~$ whereis java
java: /usr/bin/java /usr/share/java /usr/share/man/man1/java.1.gz

Create User and User Group

Let’s run Hadoop with its own user and user group.

pandian@kunthi:~$ sudo groupadd -g 599 hadoop
pandian@kunthi:~$ sudo useradd -u 599 -g 599 hadoop

Directory structure

Let’s create the directory structure

pandian@kunthi:~$ sudo mkdir -p /opt/hadoop
pandian@kunthi:~$ sudo chown hadoop:hadoop -R /opt/hadoop
pandian@kunthi:~$ sudo mkdir -p /var/lib/hadoop/journaldata
pandian@kunthi:~$ sudo chown hadoop:hadoop -R /var/lib/hadoop/journaldata

User access and sudo privilage

We are still doing linux tasks. We haven’t touched Hadoop part yet.

pandian@kunthi:~$ sudo passwd hadoop
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
pandian@kunthi:~$ sudo usermod -d /opt/hadoop hadoop
pandian@kunthi:/opt/software/hadoop$ sudo su
root@kunthi:/home/pandian# cp /etc/sudoers /etc/sudoers.20160820
root@kunthi:~# vi /etc/sudoers

I made the highlighted change.

# User privilege specification
hadoop ALL=(ALL:ALL) ALL
root@kunthi:~# exit
pandian@kunthi:~$ su - hadoop
$ pwd
$ bash
hadoop@kunthi:~$ id
uid=1001(hadoop) gid=599(hadoop) groups=599(hadoop)

Hadoop package download

I copy the link to download hadoop from http://hadoop.apache.org/releases.html. Here is how you’ll download it.

hadoop@kunthi:~$ wget http://download.nus.edu.sg/mirror/apache/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

The downloaded file is saved in the hadoop directory.
hadoop@kunthi:~$ ls -alt
total 24
-rw-rw-r-- 1 hadoop hadoop 15339 Aug 20 07:43 hadoop-2.6.4.tar.gz
hadoop@kunthi:~$ gunzip hadoop-2.6.4.tar.gz
hadoop@kunthi:~$ tar -xvf hadoop-2.6.4.tar

This will extract the tar file in a new location /opt/hadoop/hadoop-2.6.4. Here is the content of the folder.
hadoop@kunthi:~$ ls -alt hadoop-2.6.4
total 60
drwxr-xr-x 3 hadoop hadoop 4096 Aug 20 07:53 ..
drwxr-xr-x 9 hadoop hadoop 4096 Feb 12 2016 .
drwxr-xr-x 2 hadoop hadoop 4096 Feb 12 2016 bin
drwxr-xr-x 3 hadoop hadoop 4096 Feb 12 2016 etc
drwxr-xr-x 2 hadoop hadoop 4096 Feb 12 2016 include
drwxr-xr-x 3 hadoop hadoop 4096 Feb 12 2016 lib
drwxr-xr-x 2 hadoop hadoop 4096 Feb 12 2016 libexec
-rw-r--r-- 1 hadoop hadoop 15429 Feb 12 2016 LICENSE.txt
-rw-r--r-- 1 hadoop hadoop 101 Feb 12 2016 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop 1366 Feb 12 2016 README.txt
drwxr-xr-x 2 hadoop hadoop 4096 Feb 12 2016 sbin
drwxr-xr-x 4 hadoop hadoop 4096 Feb 12 2016 share

Lets create the configuration directory for Hadoop.
hadoop@kunthi:~$ sudo mkdir -p /etc/hadoop/conf
Create a softlink for the conf folder
hadoop@kunthi:~$ sudo ln -s /opt/hadoop/hadoop-2.6.4/etc/hadoop/** /etc/hadoop/conf/
hadoop@kunthi:~$ ln -s hadoop-2.6.4 hadoop

SSH Keys creation.

Hadoop wants to create key based SSH login
hadoop@kunthi:~$ mkdir ~/.ssh
hadoop@kunthi:~$ cd ~/.ssh/
hadoop@kunthi:~/.ssh$ touch authorized keys
hadoop@kunthi:~/.ssh$ touch known hosts
hadoop@kunthi:~/.ssh$ chmod 700 ~/.ssh/&& chmod 600 ~/.ssh/*
hadoop@kunthi:~/.ssh$ ssh localhost
The authenticity of host 'localhost (' can't be established.
ECDSA key fingerprint is SHA256:Fj6op9qzbfodhsQTmpQJ17G/mcAvu541bTMTb3huhPg.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
hadoop@localhost's password:
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-31-generic i686)

Bash profile – Environmental variables

As the home folder of the unix user is created by me manually, I need to create the bashprofile. I’ll get a copy of the bash profile, which is working for another user
hadoop@kunthi:~$ cp /home/pandian/.bash*
I’ll modify the above environmental variables to .bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
export HADOOP_HOME=/opt/hadoop/hadoop

Let’s apply the changes to current session
hadoop@kunthi:~$ source ~/.bashrc

Hadoop env config

Let’s specify JAVA_HOME
hadoop@kunthi:~/hadoop/etc/hadoop$ cd $HADOOP_HOME/etc/hadoop/
hadoop@kunthi:~/hadoop/etc/hadoop$ cp hadoop-env.sh hadoop-env.sh.20160820

I made the following changes to hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/

Setup passwordless ssh login

hadoop@kunthi:~/hadoop$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/opt/hadoop/.ssh/id_rsa):
Your identification has been saved in /opt/hadoop/.ssh/id_rsa.
Your public key has been saved in /opt/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:UXGO3tnfK9K8DayD0/jc+T/WgZetCHOuBAcssUw3gBo hadoop@kunthi
The key's randomart image is:
+---[RSA 2048]----+
| .+.o o.. |
| E .o = o + |
| o + + . . |
| . . + . o |
| S o o o o|
| oo o. =o|
| ==o+..=|
| =.+=+=oo|
| +=o+=++|
hadoop@kunthi:~/hadoop$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop@kunthi:~/hadoop$ sudo /etc/init.d/ssh restart
[ ok ] Restarting ssh (via systemctl): ssh.service.
hadoop@kunthi:~/hadoop$ ssh hadoop@kunthi
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-31-generic i686)

Define the slave name

Add slave hostname. After change, this is the slave name. It is similar to my hostname
hadoop@kunthi:~/hadoop$ cat /etc/hadoop/conf/slaves


Make the appropriate changes core-site.xml
hadoop@kunthi:~/hadoop$ cat etc/hadoop/core-site.xml


hadoop@kunthi:~$ cd $HADOOP_HOME
hadoop@kunthi:~/hadoop$ hadoop version
Hadoop 2.6.4
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6
Compiled by jenkins on 2016-02-12T09:45Z
Compiled with protoc 2.5.0
From source with checksum 8dee2286ecdbbbc930a6c87b65cbc010
This command was run using /opt/hadoop/hadoop-2.6.4/share/hadoop/common/hadoop-common-2.6.4.jar
hadoop@kunthi:~/hadoop$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/name
hadoop@kunthi:~/hadoop$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/name
hadoop@kunthi:~/hadoop$ sudo mkdir -p /var/lib/hadoop/cache/hadoop/dfs/data
hadoop@kunthi:~/hadoop$ sudo chown hadoop:hadoop /var/lib/hadoop/cache/hadoop/dfs/data
hadoop@kunthi:~/hadoop$ cp etc/hadoop/hdfs-site.xml etc/hadoop/hdfs-site.xml.20160820

I made the folllowing changes

Formatting and starting the namenode

hadoop@kunthi:~/hadoop$ hadoop namenode -format
16/08/20 09:15:09 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = kunthi/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.4
16/08/20 09:15:10 INFO common.Storage: Storage directory /var/lib/hadoop/cache/hadoop/dfs/name has been successfully formatted.
16/08/20 09:15:10 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at kunthi/
hadoop@kunthi:~/hadoop/sbin$ sudo mkdir /logs
hadoop@kunthi:~/hadoop/sbin$ sudo chown hadoop:hadoop /logs/
hadoop@kunthi:~/hadoop/sbin$ start-dfs.sh
hadoop@kunthi:~/hadoop/sbin$ start-yarn.sh
hadoop@kunthi:~/hadoop/sbin$ jps
6290 DataNode
6707 NodeManager
6599 ResourceManager
6459 SecondaryNameNode
6155 NameNode
7003 Jps
hadoop@kunthi:~/hadoop/sbin$ ./mr-jobhistory-daemon.sh start historyserver

Access the job tracker, name node and data node using your browser as shown below

hadoop001 - jobhistory hadoop002 - namenode information hadoop003 - datanode information



Hadoop process flow

We devide Hadoop process flow into two major classifications-

Storage and Jobs

Storage – Building blocks

We have HDFS Cluster. In this cluster we have different data nodes like node 1, node 2 …..node n. All these nodes are slave nodes in which data is stored as small chunks. These nodes are administered by Admin node using name node.

When huge data is coming in, admin node has the control over – how many nodes available, where it can be stored etc. Name node has the catalog of the data saved already in the slave nodes.

Jobs – Building blocks

We start with MapReduce Engine. It provides different set of advanced APIs to retrieve the data. MapReduce jobs are scheduled by Job Tracker, which is the admin of Jobs. It rely on Task trackers running on individual data nodes to complete a job. Generally data node has the data and task tracker.

Job tracker tracks the job completion. If the job fails on a node, it finds the alternate node to get the task done.
I depicted it pictorially below.

Hadoop storage and jobs - Javashine

Hadoop storage and jobs – Javashine


HDFS & MapReduce – Set of concepts

After writing about the ecosystem of Hadoop, I should write about wiring those blocks to see them working. Before doing this, I prefer to document the HFDS/MR paradigm quickly.

If we look at the Hadoop in a high level, we can separate it into 2 parts.


2. Map/Reduce

Nodes in Hadoop clusters stores the data in HDFS. It stores the huge volume of data as different small blocks. HDFS is running on top of unix filesystem (or others where the HDFS is running)

Searching for the data across multiple nodes, based on catalog and aggregating them to arrive at resired results is called as MP Reduce processing.

I have depicted it diagramatically below.

HDFS, MapReduce paradigm -  Javashine

HDFS, MapReduce paradigm – Javashine


Hadoop EcoSystem

Here comes another important theory after 5Vs. Yet, another interesting concept of Big data paradigm.

Inserting your data:

Sqoop/Flume – These tools would be responsible for inserting the data to the file system from various sources.


HDFS – The Hadoop Distributed File System, which stores the huge volume of data as small blocks across multiple nodes or servers.

HBase – This complements HDFS, where HDFS has handicaps. It offers Streaming or real time updates.

Map Reduce / YARN – This is the set of APIs to collate the data and process it to arrive at the desired result.

HCatalog – This is the ‘Directory’ service for HDFS.. This is helpful to access the data from the data nodes. It helps us to standardize the data access.

Hive/Pig – Analytics tools with Scripting


Oozie – This is used to create work flows

Ambari – This is used to wire the different components of Hadoop ecosystem to form a coherant operation.

Let’s talk about each one of them in detail later, if possible!

5Vs of Big data

Big data is penetrating into market with its full speed. When I started to look at it before some years, it started showing its ability to handle enterprise data with the introduction of Yarn and other ecosystem products. We had a meeting with one of our existing customers, who has typical ERP and MS products. When they say they finished exploring Big Data and decided to implement it, it shows that Big Data would become a generic skill set soon.

When I say this, I was amazed to realize how Hadoop handle 5vs of Big Data. Before discussing about how Hadoops manages those Vs.

1. Volume – We are talking about data of huge in size – TB or ZBs. It may be too much of effort to implement such a complex system for a small scale enterprise.

2. Variety – RDBMS exists to handle the structured data. We here talk about variety of data from different data sources in different format. It may be XML from RSS feeds; it may be XLS files, It may be CSV files from market real time data etc

3. Velocity – It is the speed at which we do the data analytics. For a siple example, assume a data analytics engine process the real time market data at higher speed.

4. Varacity/Verification – Facing bad data during ETL process is a common practice. Either the data may not come from the expected source at expected time. Or, the data received may not be as per our limitation. During my earlier days of ETL, I used to add many conditions in the staging tables, so that my aggregation processing will run as expected. But later, I realized i’m omitting many data during ETL process, as they fails because of my constraints. My coworker, who is an Oracle expert used to advice to insert as much data as possible in ETL tables. If aggregation process fails, then we can fine-tune it so that we don’t miss any data. Lets see how Big Data handles this.

5. Value – Okay, we have data. But how does it makes sense to me or my customer? I may have the problem of 100TB of unused data, blocking my DC space. But I may not have space to accomodate 10TB of business critical data. How am I going to face these situations.

Let’s discuss about these in the future under 5vs tag.

I’ll meet you in another post.

Happy independence Day

localhost: Error: JAVA_HOME is not set and could not be found.

I got this error while starting DFS in a new hadoop standalone deployment.

localhost: Error: JAVA_HOME is not set and could not be found.

But JAVA_HOME is set in etc/hadoop/hadoop-env.sh


But java home is set properly using the environment variables.

$ echo $JAVA_HOME

Hardcoding the java path in etc/hadoop/hadoop-env.sh solved this issue.

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Hadoop: hadoop-2.6.4

OS: Ubuntu 16.04 LTS



Setting up Eclipse IDE for Hadoop 2.5.0


You need to look at the following blog posts to understand this post in a better way.

  1. Copying the File to HDFS file system
  2. A java program to read the file from HDFS
  3. A java program to read the file from HDFS – 2
  4. Java program to read a file from Hadoop Cluster 2 (with file seek)
  5. Java program to copy a file from local disk to Hadoop Cluster with progress bar

Till now we are bugging with terminal to write the java programs. Here is how you can setup the Eclipse development environment for Hadoop.

  • This tutorial assumes you have working Hadoop 2.5.0 setup in your environment.
  • This tutorial assumes you have m2-eclipse maven plugin
  • This tutorial assumes you have installed latest version of Maven installed on your system

Set up a plain java maven project. I set it up by name my-app.

Cloud era repository is not still available…

View original post 192 more words