Automation & Analysis of RSS & ATOM newsfeeds using Hadoop Ecosystem

This project has been carried out to extract, analyse and display the RSS and ATOM news feeds.

The final goal of this project would be as given below.
1. Providing an automated workflow for the feeds extraction and analysis
2. Providing a browser based user interface for the analysis and reporting

Along with the above main goals, it has been considered to provide a scalable framework to opt-in many other feeds mechanism (social media streaming) and machine learning analytics.

The report shall be downloaded from the below given location




Map Reduce 1 – Insight

Dear fellow Hadoopers,

After a quick introduction to HDFS, my instructor started Map/Reduce concepts today. I could realize that he touched upon many concepts in a short sessions. Here is what I scribbled down.

M/R – As the name implies, it has two parts.

  1. Map
  2. Reduce



  • These are java programs written using M/R algorithm
  • Mapping programs runs on each block of big data file
  • Transformation of data, picking the URL from a web server access log file


  • These are also java programs writting using M/R algorithm
  • We do aggregations with reduce
  • The output of mapping becomes the input of Reducing.
  • We do many type of aggregations to arrive at the right results required by business logic
  • A good example is how many requests received for a particular URL of a web server.

M/R Process flow

The following gives an example of the MR process flow.


M/R 1 process flow example

We have many words in a block of a big data file. This is our input.

During the split phase, Hadoop reads the input sequentially. K1 is the key and V1 is the value. Each value denotes each record in the input.

Mapping takes the output of split and parse the content. It makes a K2 and V2 key value pair, denotes the word and number of appearance.

The process is moved to shuffling phase, where we give the K2, V2 to prepare the word and its appearance.

This output of shuffling is given to reduce process where aggregation is performed and final result is k3, v3 list is prepared.


M/R 1 Task Execution

  1. We launch the job from client to Job tracker
  2. Job tracker is running on a server class machine
  3. Job tracker can submit the job to each task tracker on data nodes.
  4. It can monitor the jobs on all nodes.
  5. If it is preferred, it can kill the tasks.

Hadoop process flow

We devide Hadoop process flow into two major classifications-

Storage and Jobs

Storage – Building blocks

We have HDFS Cluster. In this cluster we have different data nodes like node 1, node 2 …..node n. All these nodes are slave nodes in which data is stored as small chunks. These nodes are administered by Admin node using name node.

When huge data is coming in, admin node has the control over – how many nodes available, where it can be stored etc. Name node has the catalog of the data saved already in the slave nodes.

Jobs – Building blocks

We start with MapReduce Engine. It provides different set of advanced APIs to retrieve the data. MapReduce jobs are scheduled by Job Tracker, which is the admin of Jobs. It rely on Task trackers running on individual data nodes to complete a job. Generally data node has the data and task tracker.

Job tracker tracks the job completion. If the job fails on a node, it finds the alternate node to get the task done.
I depicted it pictorially below.

Hadoop storage and jobs - Javashine

Hadoop storage and jobs – Javashine


HDFS & MapReduce – Set of concepts

After writing about the ecosystem of Hadoop, I should write about wiring those blocks to see them working. Before doing this, I prefer to document the HFDS/MR paradigm quickly.

If we look at the Hadoop in a high level, we can separate it into 2 parts.


2. Map/Reduce

Nodes in Hadoop clusters stores the data in HDFS. It stores the huge volume of data as different small blocks. HDFS is running on top of unix filesystem (or others where the HDFS is running)

Searching for the data across multiple nodes, based on catalog and aggregating them to arrive at resired results is called as MP Reduce processing.

I have depicted it diagramatically below.

HDFS, MapReduce paradigm -  Javashine

HDFS, MapReduce paradigm – Javashine


Hadoop EcoSystem

Here comes another important theory after 5Vs. Yet, another interesting concept of Big data paradigm.

Inserting your data:

Sqoop/Flume – These tools would be responsible for inserting the data to the file system from various sources.


HDFS – The Hadoop Distributed File System, which stores the huge volume of data as small blocks across multiple nodes or servers.

HBase – This complements HDFS, where HDFS has handicaps. It offers Streaming or real time updates.

Map Reduce / YARN – This is the set of APIs to collate the data and process it to arrive at the desired result.

HCatalog – This is the ‘Directory’ service for HDFS.. This is helpful to access the data from the data nodes. It helps us to standardize the data access.

Hive/Pig – Analytics tools with Scripting


Oozie – This is used to create work flows

Ambari – This is used to wire the different components of Hadoop ecosystem to form a coherant operation.

Let’s talk about each one of them in detail later, if possible!