I had a overdoze of theoritical sessions today. I’ll try to summarize what are all captured by my mind so far.
This blog post focuses on stages of Big data processing and example of the tools used in these phases.
We should be able to acquire data from different data sources.
Logs: Log files, activitiy logs, service logs etc
Database: Standard RDBMS data
Content Inventory: Articles, Goods, Books, blogs
We can look into acquization tools to work for this phase.
We have Sqoop to work with Database, Flume for streaming data, Web HDFS for web based APIs, Webdav for drag and drop of data. On the other hand, we can use Flume, Scribe Kafka to insert streaming data.
After ingesting the data we need to apply business rules and prepare meta data before putting into HDFS.
Preparation tools applicable for this phase are given below.
Data governance: Falcon
Metadata Management: Hcatalog
Based on business rules, the data is computed.
Tools for processing the data are given below.
Batch: MapReduce/Hive/Pig, Streaming
Work flow engines: Oozie, Cascading, Askaban
Graphical tools: Giraph, neo4j, flockdb, infogrid
We use HDFS to store the data
Pls make sure if we don’t store sensitive data like passwords, credit card numbers, bank details etc.
While storing the data, we need to update the indexes. It would be useful to find out the most occurring instances etc.
Render the data using tools using the catalogs/indexes.
Store and analytics will logs their activity. We can still go ahead and process those log.
The logs can stil be analyzed, stored… It may follow the same path as given above.
The processed data would be stored in a different folder to retrieve it later. This data can be an input to another batch job.
Some of the tools to be mentioned here –
- Presentation – to use SQL like querying in Hadoop – Tez, Shark, Drill, Presto, Hawq
- Point Query – Cassandra, Mongo
- MPP (Massive Parallel Processing) – Impala, Stado.
- Archival – DAS/HDFS
- Security – Sentry, KNOX, Hadoop Security (Kerborose)
- Mgmt – Cloudera Manager, Ambari, Puppet, Chef
- Monitoring – Ganglia, Nagios, Munin, Sensu
- Reporting – Pentaho, Jasper, BIRT.
Lets see how it goes.