Lab 17 – Adding custom jars to MapReduce using Distributed Cache

Hi,

This is part of my blog post series on MapReduce. But let me tell you upfront, this doesn’t work. The job fails with ClassNotFound exception. I’ll come back again and update this post if I rectify this. As I spent too much time on this, I’ll use -libjars parameter and GenericOptionsParser for such requirements.

One of the other post Including Third-Party Libraries in my Map-Reduce Job (using distributed cache) says, this doesn’t work. I really don’t know how it worked for others. let’s put this on hold.

logo-mapreduce

Driver:

package org.grassfield.hadoop.dc;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * hadoop jar FeedCategoryCount-17.jar org.grassfield.hadoop.dc.DcJob /user/hadoop/lab17/input/feedList.csv /user/hadoop/lab17/05 /user/hadoop/lab17/dc/temp-0.0.1-SNAPSHOT.jar
 * @author pandian
 *
 */
public class DcJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job(getConf());
        Configuration conf = job.getConfiguration();
        job.setJarByClass(this.getClass());
        job.setJobName("DcJob");
        Path path = new Path(args[2]);
        DistributedCache.addFileToClassPath(path, conf);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(DcMapper.class);
        job.setNumReduceTasks(0);
        job.waitForCompletion(true);
        return 0;
    }
    public static void main(String [] args) throws Exception{
        ToolRunner.run(new Configuration(), new DcJob(), args);
    }

}

Mapper

package org.grassfield.hadoop.dc;

import java.io.IOException;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import temp.ToUpper;

/**
 * hadoop jar FeedCategoryCount-17.jar org.grassfield.hadoop.dc.DcJob /user/hadoop/lab17/input/feedList.csv /user/hadoop/lab17/05 /user/hadoop/lab17/dc/temp-0.0.1-SNAPSHOT.jar
 * @author pandian
 *
 */
public class DcMapper
        extends Mapper<LongWritable, Text, Text, LongWritable> {
    Path[] localCacheArchives;
    Path[] localCacheFiles;

    @Override
    protected void map(LongWritable key, Text value,
            Mapper<LongWritable, Text, Text, LongWritable>.Context context)
            throws IOException, InterruptedException {
        context.write(new Text(ToUpper.toUpper(value.toString())), key);
    }

    @Override
    protected void setup(
            Mapper<LongWritable, Text, Text, LongWritable>.Context context)
            throws IOException, InterruptedException {
        localCacheArchives = DistributedCache.getLocalCacheArchives(context.getConfiguration());
        localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
    }
}

I already have the dependency jar file in HDFS.

$ hadoop fs -ls /user/hadoop/lab17/dc
Found 1 items
-rw-r--r--   3 hadoop supergroup       4539 2016-10-08 07:36 /user/hadoop/lab17/dc/temp-0.0.1-SNAPSHOT.jar

Following is the content of the jar

$ jar -tvf temp-0.0.1-SNAPSHOT.jar
     0 Fri Oct 07 07:24:08 MYT 2016 META-INF/
   133 Fri Oct 07 07:24:06 MYT 2016 META-INF/MANIFEST.MF
     0 Fri Oct 07 07:24:08 MYT 2016 temp/
  1276 Fri Oct 07 07:24:08 MYT 2016 temp/CharTest.class
  1750 Fri Oct 07 07:24:08 MYT 2016 temp/DateParser.class
   896 Fri Oct 07 07:24:08 MYT 2016 temp/Jdbc.class
   464 Fri Oct 07 07:24:08 MYT 2016 temp/ToUpper.class
     0 Fri Oct 07 07:24:08 MYT 2016 META-INF/maven/
     0 Fri Oct 07 07:24:08 MYT 2016 META-INF/maven/temp/
     0 Fri Oct 07 07:24:08 MYT 2016 META-INF/maven/temp/temp/
   927 Fri Oct 07 07:23:26 MYT 2016 META-INF/maven/temp/temp/pom.xml
   107 Fri Oct 07 07:24:08 MYT 2016 META-INF/maven/temp/temp/pom.properties

Let’s hope this will be resolved.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s