Bigdata processing pipeline and tools

I had a overdoze of theoritical sessions today. I’ll try to summarize what are all captured by my mind so far.

This blog post focuses on stages of Big data processing and example of the tools used in these phases.


We should be able to acquire data from different data sources.

Logs: Log files, activitiy logs, service logs etc

Database: Standard RDBMS data

Content Inventory: Articles, Goods, Books, blogs

We can look into acquization tools to work for this phase.

We have Sqoop to work with Database, Flume for streaming data, Web HDFS for web based APIs, Webdav for drag and drop of data. On the other hand, we can use Flume, Scribe Kafka to insert streaming data.


After ingesting the data we need to apply business rules and prepare meta data before putting into HDFS.

Preparation tools applicable for this phase are given below.

Data governance: Falcon

Metadata Management: Hcatalog


Based on business rules, the data is computed.

Tools for processing the data are given below.

Batch: MapReduce/Hive/Pig, Streaming

Work flow engines: Oozie, Cascading, Askaban

Graphical tools: Giraph, neo4j, flockdb, infogrid


We use HDFS to store the data


Pls make sure if we don’t store sensitive data like passwords, credit card numbers, bank details etc.


While storing the data, we need to update the indexes. It would be useful to find out the most occurring instances etc.


Render the data using tools using the catalogs/indexes.


Store and analytics will logs their activity. We can still go ahead and process those log.


The logs can stil be analyzed, stored… It may follow the same path as given above.

The processed data would be stored in a different folder to retrieve it later. This data can be an input to another batch job.
Some of the tools to be mentioned here –

  • Presentation – to use SQL like querying in Hadoop – Tez, Shark, Drill, Presto, Hawq
  • Point Query – Cassandra, Mongo
  • MPP (Massive Parallel Processing) – Impala, Stado.
  • Archival – DAS/HDFS
  • Security – Sentry, KNOX, Hadoop Security (Kerborose)
  • Mgmt – Cloudera Manager, Ambari, Puppet, Chef
  • Monitoring – Ganglia, Nagios, Munin, Sensu
  • Reporting – Pentaho, Jasper, BIRT.

Lets see how it goes.


One thought on “Bigdata processing pipeline and tools

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s