Total Pageviews

Saturday, 6 September 2014

What is Hadoop...Apache Hadoop Terms/Abbreviations.

What is Hadoop

Need of Hadoop:


Big Data:

Today we are surrounded by data, infact it won't be wrong to say that we live in data age. The amount of data is increasing exponentially. As the data is increasing it is becoming more and more challenging for organisations to maintain and analyse this huge amount of data. The success of an organisation largely depends on their ability to extract valuable information from this huge amount of data. Hadoop uses the approach of scaling out rather than scaling up to DEAL with this exploding data i.e using more systems of computer rather than bigger computer systems.

Data Storage:

The access speeds of hard drives have not increased proportionally to their storage capacities over the years. As a result it takes hours to read an entire hard disk and even more time to perform write operations. However this problem can be solved by dividing the data over multiple hard drives and parallely reading the data from these hard drives. 

Parallel read and write operations raises new issues like
  1. Need to handle hardware failures: Hadoop has its own distributed filesystem called HDFS which DEALS with hardware failures by data replication. We'll learn more about HDFS in upcoming posts .
  2. Ability to combine data from different drives: Most of the analysis will require data from different  hard drives. Hadoop uses MapReduce programming model which abstracts this problem by tranforming it into computations over key and value pair. We'll learn this programming model in upcoming posts. For now all you need to know is that there are two phases of computation Mapping and Reducing. Mixing occurs at the interface between these two phases. 
Thus in short we can say that Hadoop provides us with two components HDFS and MapReduce that provides reliable shared storage and analysis system.

Hadoop Introduction:


Hadoop is a framework for implementing distributed computing to process big data.Some of the key features of Hadoop are

  1. Accessibility: Hadoop runs on large clusters of commodity hardware.
  2. Robustness: Hadoop handles failures by replication of data.
  3. Scalability: Hadoop scales up linearly.
  4. Simplicity: Hadoop allows users to write parallel programs quickly.

          The image below shows how users interacts with a Hadoop cluster.


          Client interaction with Hadoop cluster

          In my next post I'll show where Hadoop stands in terms of comparison with other systems.
          Apache Hadoop Terms/Abbreviations:-

          HDFS - Hadoop Distributed File System
          GFS - Google File System
          JSON - Java Script Object Notation
          NN - NameNode
          DN - Data Node
          SNN - Secondary NameNode
          JT - Job Tracker
          TT - Task Tracker
          HA NN - Highly Available NameNode (or NN HA - NameNode Highly Available)
          REST - Representational State Transfer
          HiveQL - Hive SQL
          CDH - Cloudera’s Distribution Including Apache Hadoop
          ZKFC - ZooKeeper Failover Controller
          FUSE - Filesystem In Userspace
          YARN - Yet Another Resource Negotiator
          Amazon EC2 - Amazon Elastic Compute Cloud
          Amazon S3 - Amazon Simple Storage Service

          2 comments: