Bigdata processing pipeline and tools

I had a overdoze of theoritical sessions today. I’ll try to summarize what are all captured by my mind so far.

This blog post focuses on stages of Big data processing and example of the tools used in these phases.

Ingest

We should be able to acquire data from different data sources.

Logs: Log files, activitiy logs, service logs etc

Database: Standard RDBMS data

Content Inventory: Articles, Goods, Books, blogs

We can look into acquization tools to work for this phase.

We have Sqoop to work with Database, Flume for streaming data, Web HDFS for web based APIs, Webdav for drag and drop of data. On the other hand, we can use Flume, Scribe Kafka to insert streaming data.

Augment

After ingesting the data we need to apply business rules and prepare meta data before putting into HDFS.

Preparation tools applicable for this phase are given below.

Data governance: Falcon

Metadata Management: Hcatalog

Analyze

Based on business rules, the data is computed.

Tools for processing the data are given below.

Batch: MapReduce/Hive/Pig, Streaming

Work flow engines: Oozie, Cascading, Askaban

Graphical tools: Giraph, neo4j, flockdb, infogrid

Store

We use HDFS to store the data

Privacy

Pls make sure if we don’t store sensitive data like passwords, credit card numbers, bank details etc.

Scoring

While storing the data, we need to update the indexes. It would be useful to find out the most occurring instances etc.

Render

Render the data using tools using the catalogs/indexes.

Logs

Store and analytics will logs their activity. We can still go ahead and process those log.

Analyze

The logs can stil be analyzed, stored… It may follow the same path as given above.

The processed data would be stored in a different folder to retrieve it later. This data can be an input to another batch job.
Some of the tools to be mentioned here –

  • Presentation – to use SQL like querying in Hadoop – Tez, Shark, Drill, Presto, Hawq
  • Point Query – Cassandra, Mongo
  • MPP (Massive Parallel Processing) – Impala, Stado.
  • Archival – DAS/HDFS
  • Security – Sentry, KNOX, Hadoop Security (Kerborose)
  • Mgmt – Cloudera Manager, Ambari, Puppet, Chef
  • Monitoring – Ganglia, Nagios, Munin, Sensu
  • Reporting – Pentaho, Jasper, BIRT.

Lets see how it goes.

Advertisements

One thought on “Bigdata processing pipeline and tools

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s