In the recent years, the fact has been well realized that data is the one of the major drivers of business. Data and dependent analysis can help the business groups in measuring their depth, measuring the market pace, gap analysis, and know consumer feedback. Data has been always at its best in challenging the technology. Data management paradigms have evolved from flat files to RDBMS and once again, from RDBMS to need based databases like NoSQL and in program file structures. But still, data analysis of massive lumps has been a real challenge.
In this article, we shall see one of the latest technologies in data management and analysis i.e. Hadoop. Hadoop and its member components provide infrastructure to manage and analyze large volumes of data.
Hadoop is an open source framework to deal with the management and processing of huge volumes of data. Hadoop can be implemented in data centers or commodity machines to work with large nodes of clusters, assuming the hardware to be robust and adequately sized. The working principle of hadoop file system is quite similar to that of NoSQL database i.e. organization of data into key value pair. Each unit of data is identified as a key value pair while the relationship between the keys is defined from the application tier.
The salient features of hadoop framework includes –
- Open source, shared nothing computing model
- Batch data processing in parallel and large clusters
- Intelligently handles the node failures and does the redeployment, as required
The high level picture of hadoop architecture includes the key components which are discussed as below –
Hadoop client is on the workstation which acts as the launchpad for the data activities. It kicks off the processing and submits the job to the hadoop cluster.
NameNode – Every hadoop cluster has one NameNode. It is the manager of HDFS file system and controls the slave DataNode daemons. It is the most important node of the hadoop cluster which manages the metadata and keeps the node – block information. Note that it is also the single point of failure for the respective hadoop cluster.
DataNode – The slave node in the hadoop cluster hosts a DataNode which is responsible for end to end management of data on physical nodes and HDFS.
JobTracker – It mediates the application and hadoop cluster. Every hadoop cluster has a JobTracker of its own which prepares the execution plan for the code submitted by the end user. It tracks the location of data across the nodes, simulates the nodes to respond, and manage the execution of the task.
TaskTracker – The TaskTracker is present at each slave node which communicates with the JobTracker for the execution of a task or code. Apart from job execution, TaskTracker and JobTracker keep in touch to for node health checkup.
Hadoop Distributed file system (HDFS) – A distributed file system which can hold large chunks of data across the nodes in a cluster. The file system can be accessed either through a user interface application or a command like interface.
MapReduce is a distributed computing model invented by Google to operate on large data sets. It is capable of handling parallel processing on large volumes of data distributed across the node clusters. Hadoop supports MapReduce computing model. A MapReduce program can be written in many languages like Java, Python or Perl.
As the name clearly suggests, the computing model has two components namely, mapper and reducer. A hadoop cluster can have multiple mappers and reducers running in parallel. The mappers use the data worksets from the hadoop’s partitioner to map processes. The reducers work with the results from the map processes and submit to the mapReduce program.
Hadoop is the solution to deal with humongous volume of data. Its actual benefits like robustness, scalability and availability can be realized only when the data is flow is massive. Potential players of hadoop and similar implementations could be Twitter, Google, Facebook or Amazon which generate huge data though tweets, social responses, blogs, and emails. In comparatively small sized data centers, hadoop could be an under utilized and costly affair too.
Readers can continue their reading from the below web links.
Readers can go through the tutorials on youtube channels to get deeper understanding on the topic
You can fine all the upcoming programs for Data Analysis with Hadoop @ http://www.bookmytrainings.com/data-analysis-a-hadoop/hadoop