Lab 03 – A Hadoop Mapper to get the category of an RSS feed with Eclipse and Maven

Dear Hadoopers,

I hope you found my earlier post to write a simple mapper https://javashine.wordpress.com/2016/09/11/lab-02-a-simple-hadoop-mapper-with-eclipse-and-maven/ was interesting. Here is another Mapper program with another task. We would be reading the category available in a RSS feed xml.

logo-mapreduce

Mapper

package org.grassfield.hadoop;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * A Mapper Program to count the categories in RSS XML file
 * This may not be the right approach to parse the XML.
 * Only for demo purpose
 * @author pandian
 *
 */
public class FeedCategoryCountMapper extends Mapper {
    private IntWritable one = new IntWritable(1);
    private Text category = new Text();

    @Override
    protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        if (line.indexOf("<category>")>=0){
            //<category><![CDATA[Big Data]]></category>
            line = line.trim();
            line = line.substring(
                    line.indexOf("CDATA[")+6,
                    line.lastIndexOf("]")-1
                    );
            //Big Data
            category.set(line.trim());
            context.write(category, one);
        }
    }
}

hadoop_yarn

Driver

Our driver is pretty much similar as that of our earlier program.

package org.grassfield.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * A Mapper Driver Program to count the categories in RSS XML file
 * This may not be the right approach to parse the XML.
 * Only for demo purpose
 * @author pandian
 *
 */
public class FeedCategoryCountDriver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        args = parser.getRemainingArgs();
        Path input = new Path(args[0]);
        Path output = new Path(args[1]);
        Job job = new Job(conf, "Feed Category Count");
        job.setJarByClass(getClass());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setNumReduceTasks(0);
        FileInputFormat.setInputPaths(job, input);
        FileOutputFormat.setOutputPath(job, output);
        job.setMapperClass(FeedCategoryCountMapper.class);
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new Configuration(), new FeedCategoryCountDriver(), args));
    }
}

Execution

Let’s run the eclipse project as a Maven build using clean install targets and SCP the jar to hadoop machine.

Ok, we are ready to execute this.

hadoop@gandhari:~/jars$ hadoop jar FeedCategoryCount-0.0.1-SNAPSHOT.jar org.grassfield.hadoop.FeedCategoryCountDriver /user/hadoop/lab03/feed /user/hadoop/output/lab03_2

Let’s check the output folder, my friend.

hadoop@gandhari:~/jars$ hadoop fs -ls /user/hadoop/output/lab03_2
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2016-09-10 14:56 /user/hadoop/output/lab03_2/_SUCCESS
-rw-r--r--   3 hadoop supergroup        557 2016-09-10 14:56 /user/hadoop/output/lab03_2/part-m-00000

Output with 557 bytes has been written. Thats interesting. Let’s cat it.

hadoop@gandhari:~/jars$ hadoop fs -cat /user/hadoop/output/lab03_2/part-m-00000
Big Data        1
YARN    1
Application Master      1
Resource Manager        1
Task Container  1
Big Data        1
Hadoop  1
Map Reduce      1
MapReduce       1
Job Tracker     1
Task Tracker    1
Big Data        1
HDFS    1
HDFS commands   1
HDFS permissions        1
Big Data        1
HDFS    1
Big Data        1
HDFS commands   1
Big Data        1
HDFS    1
Big Data        1
HDFS    1
HDFS block      1
HDFS replication        1
Big Data        1
HDFS    1
hadoop federation       1
hadoop high availability        1
hadoop rack-aware       1
ZooKeeperFailoverController     1
Big Data        1
HDFS    1
BDFS split      1
HDFS block      1
Big Data        1
Flume   1
Hadoop  1
HBase   1
Hive    1
Hue     1
Oozie   1
Pig     1
sqoop   1
ZooKeeper       1

Advertisements

One thought on “Lab 03 – A Hadoop Mapper to get the category of an RSS feed with Eclipse and Maven

  1. Pingback: Lab 05 – A hadoop combiner demo | JavaShine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s