Lab 02 – A Simple Hadoop Mapper with Eclipse and Maven

Hi Hadoopers,

All the tasks given below are done on the Hadoop server. I assume you have downloaded the Eclipse IDE for your platform. (I use STS for this demo, as it has Maven plugin out of the box)

Create a new Java project.

screenshot-from-2016-09-09-17-54-34

After opening the Java project, convert it in to Maven project.

screenshot-from-2016-09-09-17-57-07

After adding maven capability add the following dependencies of Hadoop 2.6.4.

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.6.4</version>
        </dependency>

    </dependencies>

Mapper

logo-mapreduce

Let’s create a mapper to count the words in a file. Refer to https://hadoop.apache.org/docs/r2.7.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0 for more details.

package org.grassfield.hadoop;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * @author Pandian
 *
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    /**
     * This is to store the output string
     */
    private Text word = new Text();
    
    
    /**
     * This is to denote each occurrence of the word
     */
    private IntWritable one = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        //read line by line
        String line = value.toString();
        
        //tokenize by comma
        StringTokenizer st = new StringTokenizer(line, ",");
        while (st.hasMoreTokens()) {
            //store the token as the word
            word.set(st.nextToken());
            
            //register the count
            context.write(word, one);
        }
    }

}

Driver

Mapper class cannot be executed by itself. hence we write a mapper driver.

package org.grassfield.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;

import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;

/**
 * @author pandian
 *
 */
public class WordCountDriver extends Configured implements Tool {
    public WordCountDriver(Configuration conf){
        //Assign the configuration to super class. 
        //Otherwise you will get null pointer for getConf()
        super(conf);
    }

    @Override
    public int run(String[] args) throws Exception {
        //get the configuration
        Configuration conf = getConf();

        //initiate the parser with arguments and configuration
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        args = parser.getRemainingArgs();
        
        //input and output HDFS locations are received as command line argument
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        
        //Mapper Job is defined
        Job job = new Job(conf, "Word Count Driver");
        job.setJarByClass(getClass());
        
        //Output format is defined
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        
        //reducer count is defined. We do not have any reducers for this assignment
        job.setNumReduceTasks(0);
        
        //File input and output formats are specified
        FileInputFormat.setInputPaths(job,  input);
        FileOutputFormat.setOutputPath(job, output);
        
        //Set the mapper class and run the job
        job.setMapperClass(WordCountMapper.class);
        boolean b = job.waitForCompletion(true);
        
        return 0;
    }
    
    /**
     * @param args input and output files are specified
     * @throws Exception
     */
    public static void main (String[]args) throws Exception{
        Configuration conf = new Configuration();
        WordCountDriver driver = new WordCountDriver(conf);
        driver.run(args);
    }

}

Export

Let’s execute maven with clean install target. Jar will be copied to the below given location

[INFO] --- maven-install-plugin:2.4:install (default-install) @ WordCount ---
[INFO] Installing D:\workspace_gandhari\WordCount\target\WordCount-0.0.1-SNAPSHOT.jar to C:\Users\pandian\.m2\repository\WordCount\WordCount\0.0.1-SNAPSHOT\WordCount-0.0.1-SNAPSHOT.jar

I copy the jar file to hadoop server as hadoop user.

Input file

To execute the mapper, we need a file in HDFS. I have the following file already.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab01/month.txt
chithirai
vaigasi
aani
aadi
aavani
purattasi
aippasi
karthikai
margazhi
thai
thai
panguni

hadoop@gandhari:~/jars$ hadoop jar WordCount-0.0.1-SNAPSHOT.jar org.grassfield.hadoop.WordCountDriver /user/hadoop/lab01/month.txt /user/hadoop/lab01/output/10
...
16/09/09 23:32:22 INFO mapreduce.JobSubmitter: number of splits:1
16/09/09 23:32:23 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/09/09 23:32:23 INFO mapreduce.Job: Running job: job_local57180184_0001
16/09/09 23:32:23 INFO mapred.LocalJobRunner: Starting task: attempt_local57180184_0001_m_000000_0
16/09/09 23:32:23 INFO mapred.MapTask: Processing split: hdfs://gandhari:9000/user/hadoop/lab01/month.txt:0+90
16/09/09 23:32:23 INFO output.FileOutputCommitter: Saved output of task 'attempt_local57180184_0001_m_000000_0' to hdfs://gandhari:9000/user/hadoop/lab01/output/10/_temporary/0/task_local57180184_0001_m_000000
16/09/09 23:32:24 INFO mapreduce.Job:  map 100% reduce 0%
16/09/09 23:32:24 INFO mapreduce.Job: Job job_local57180184_0001 completed successfull
y

Output file

Let’s see if our output file is created

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/lab01/output/10
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-09 23:32 /user/hadoop/lab01/output/10/_SUCCESS
-rw-r--r--   3 hadoop supergroup        114 2016-09-09 23:32 /user/hadoop/lab01/output/10/part-m-00000

part-m denotes that it is the output of mapper. Here is the output of our job.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/lab01/output/10/part-m-00000
chithirai       1
vaigasi 1
aani    1
aadi    1
aavani  1
purattasi       1
aippasi 1
karthikai       1
margazhi        1
thai    1
thai    1
panguni 1

Interesting, isn’t it!

Have a good week.

Advertisements

3 thoughts on “Lab 02 – A Simple Hadoop Mapper with Eclipse and Maven

  1. Pingback: Lab 03 – A Hadoop Mapper to get the category of an RSS feed with Eclipse and Maven | JavaShine

  2. Pingback: Lab 04 – A Hadoop Reducer demo | JavaShine

  3. Pingback: Lab 05 – A hadoop combiner demo | JavaShine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s