Total Pageviews

Monday, 22 September 2014

Hello World! – An introduction to Hadoop with Hive and Pig

Overview of Apache Hadoop and Hortonworks Data Platform

The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform(HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials will walk you through bringing some sample data into HDP and manipulate it using the tools built into HDP. The idea is to show you how you can get started and show you how to accomplish tasks in HDP. HDP is FREE to download and use in your enterprise and you can download it here: Hortonworks Data Platform
Download
Yahoo Data<br />
                Nodes
The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here's a short video introduction to MapReduce: Introduction to MapReduce
Yahoo Map<br />
                Reduce
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.

The Hadoop Distributed File System

In this section we are going to take a closer look at some of the components we will be using in the Sandbox tutorials. Underlying all of these components is the Hadoop Distributed File System(HDFS™). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of datanodes.

Apache Hive™

Hive UI<br />
                Overview
The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.

Apache HCatalog

HCat UI<br />
                Overview
The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.

Apache Pig™

Pig UI<br />
                Overview
Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.
That‘s all for now… let‘s get started with some examples of using these tools together to solve real problems!

Using HDP

Here we go! We're going to walk you through a series of step-by-step tutorials to get you up and running with the Hortonworks Data Platform(HDP).

Downloading Example Data

We'll need some example data for our lessons. For our first lesson, we'll be using stock ticker data from the New York Stock Exchange from the years 2000-2001. You can download this file here:
The file is about 11 megabytes, and may take a few minutes to download. Fortunately, to learn 'Big Data' you don't have to use a massive dataset. You need only use tools that scale to massive datasets. Click and SAVEthis file to your computer.

Using the File Browser

You can reach the File Browser by clicking its icon:
Pick File<br />
                Browser
The File Browser interface should be familiar to you as it is similar to the file manager on a Windows PC or Mac. We begin in our home directory. This is where we'll store the results of our work. File Browser also lets us upload files.

Uploading a File

To upload the example data you just downloaded,
Select File<br />
                Upload
  • Select the 'Upload' button
  • Select 'Files' and a pop-up window will appear.
  • Click the button which says, 'Upload a file'.
  • Locate the example data file you downloaded and select it.
  • A progress meter will appear. The upload may take a few moments.
When it is complete you'll see this:
Uploaded Stock<br />
                Data
Now click the file name "NYSE-2000-2001.tar.gz". You'll see it, displayed in tabular form:
View Stock<br />
                Data
You can use File Browser just like your own computer's file manager. Next register the dataset with HCatalog.

Loading the sample data into HCatalog

Now that we've uploaded a file to HDFS, we will register it with HCatalog to be able to access it in both Pig and Hive.
Select the HCatalog icon in the icon bar at the top of the page:
Select<br />
                HCat
Select "Create a new table from file" from the Actions menu on the left.
HCat Create<br />
                Table
Fill in the Table Name field with 'nyse_stocks'. Then click on Choose a file button. Select the file we just uploaded 'NYSE-2000-2001.tsv.gz'.
HCat Choose<br />
                File
You will now see the options for importing your file into a table. The File options should be fine. In Table preview set all text type fields to Column Type 'string' and all decimal fields (ex: 12.55) to Column Type 'float.' The one exception is 'stock_volume' field should be set as 'bigint.' When everything is complete click on the "Create Table" button at the bottom.
HCat Define<br />
                Columns

A Short Apache Hive Tutorial

In the previous sections you:
  • Uploaded your data file into HDFS
  • Used Apache HCatalog to create a table
Apache Hive™ provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database.
To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.
Let's open Beeswax. Click on the bee icon on the top bar.
Beeswax
On the right hand side there is a query window and an execute button. We will be typing our queries in the query window. When you are done with a query please click on the execute button. Note: There is a limitation of one query in the composition window. You can not type multiple queries separated by semicolons.
Since we created our table in HCatalog, Hive automatically knows about it. We can see the tables that Hive knows about by clicking on the Tables tab.
Tables
In the list of the tables you will see our table, nyse_stocks. Hive inherits the schema and location information from HCatalog. This separates meta information like schema and location from the queries. If we did not have HCatalog we would have to build the table by providing location and schema information.
We can see the records by typing Select * from nyse_stocks in the Query window. Our results would be:
Data<br />
                Table
We can see the columns in the table by executing describe nyse_stocks
NYSE
We will then get a description of the nyse table.
Describe NYSE<br />
                Table
We can count the records with the query select count(*) from nyse_stocks. You can click on the Beeswax icon to get back to the query screen. Evaluate the expression by typing it in the query window and hitting execute.
Select<br />
                Count
This job takes longer and you can watch the job running in the log. When the job is complete you will see the results posted in the Results tab.
Select Count<br />
                Results
You can select specific records by using a query like select * from nyse_stocks where stock_symbol="IBM".
Select<br />
                IBM
This will return the records with IBM.
Select IBM<br />
                Results
So we have seen how we can use Apache Hive to easily query our data in HDFS using the Apache Hive query language. We took full advantage of HCatalog so we did not have to specify our schema or location of the data. Apache Hive allows people who are knowledgable in query languages like SQL to immediately become PRODUCTIVE with Apache Hadoop. Once they know the schema of the data can they quickly and easily formulate queries.

Pig Basics Tutorial

In this tutorial we create and run Pig scripts. On the left is a list of scripts that we have created. In the middle is an area for us to compose our scripts. We will also load the data from the table we have stored in HCatalog. We will then filter out the records for the stock symbol IBM. Once we have done that we will calculate the average of closing stock prices over this period.
The basic steps will be:
  • Step 1: Create and name the script
  • Step 2: Loading the data
  • Step 3: Select all records starting with IBM
  • Step 4: iterate and average
  • Step 5: SAVE the script and execute it
Let's get started…
To get to the Pig interface click on the Pig icon on the icon bar at the top. This will bring up the Pig user interface. On the left is a list of your scripts and on the right is a composition box for your scripts.
A special feature of the interface is the Pig helper at the bottom. The Pig helper will provide us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions.
At the very bottom are status areas that will show the results of our script and log files
PIG<br />
                UI

Step 1: Create and name the script

  • Open the Pig interface by clicking the Pig icon at the top of the screen
    Image001-1
  • Title your script by filling in the title box
    Image001-2

Step 2: Loading the data

Our first line in the script will load the table. We are going to use HCatalog because this allows us to share schema across tools and users within our Hadoop environment. HCatalog allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository. Since it is in HCatalog we can use the HCatLoader() function. Pig makes it easy by allowing us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table.
  • On the right hand side we can start adding our code at Line 1
  • We can use the Pig helper at the bottom of the screen to give us a template for the line. Click on Pig helper -> HCatalog->load template
  • The entry %TABLE% is highlighted in red for us. Type the name of the table which is nyse_stocks.
  • Remember to add the a = before the template. This SAVES the results into a. Note the `= has to have a space before and after it.
Our completed line of code will look like:
a = LOAD ‘default.nyse_stocks’ USING org.apache.hcatalog.pig.HCatLoader()
Line<br />
                1
So now we have our table loaded into Pig and we stored it "a"

Step 3: Select all records starting with IBM

The next step is to select a subset of the records so that we just have the records for stock ticker of IBM. To do this in Pig we use the Filter operator. We tell Pig to Filter our table and keep all records where stock_symbol="IBM" and store this in b. With this one simple statement Pig will look at each record in the table and filter out all the ones that do not meet our criteria. The group statement is important because it groups the records by one or more relations. In this case we just specified all rather than specify the exact relation we need.
  • We can use Pig Help again by clicking on Pig helper->Relational Operators->FILTER template
  • We can replace %VAR% with "a" (hint: tab jumps you to the next field)
  • Our %COND% is "stock_symbol =='IBM'` " (note: single quotes are needed around IBM and don't forget the trailing semi-colon)
  • Pig helper -> Relational Operators->GROUP BY template
  • The first %VAR% is "b" and the second%VAR%is "all". You will need to correct an irregularity in the Pig syntax here. Remove the "BY`" in the line of code.
  • Again add the trailing semi-colon to the code.
So the final code will look like:
b = filter a by stock_symbol == 'IBM';
c = group b all;
Line<br />
                3
Now we have extracted all the records with IBM as the stock_symbol.

Step 4: Iterate and Average

Now that we have the right set of records we can iterate through them and create the average. We use the "foreach" operator on the grouped data to iterate through all the records. The AVG() function creates the average of the stock_volume field. To wind it up we just print out the results which will be a single floating point number. If our results would be used for a future job we can SAVE it back into a table.
  • Pig helper ->Relational Operators->FOREACH template will get us the code
  • Our first %VAR% is c and the second %VAR% is "AVG(b.stock_volume);"
  • We add the last line with Pig helper->I/O->DUMP template and replace %VAR% with "d".
Our last two lines of the script will look like:
d = foreach c generate AVG(b.stock_volume);
dump d;
Line<br />
                5
So the variable "d" will contain the average volume of IBM stock when this
line is executed.

Step 5: Save the script and Execute it

We can save our completed script using the Save button at the bottom and then we can Execute it. This will create a MapReduce job(s) and after it runs we will get our results. At the bottom there will be a progress bar that shows the job status.
  • At the bottom we click on the Save button again
  • Then we click on the Execute button to run the script
  • Below the Execute button is a progress bar that will show you how things are running.
  • When the job completes you will see the results in the green box.
  • Click on the Logs link to see what happened when your script ran. The average of stock_volume This is where you will see any error messages. The log may scroll below the edge of your window so you may have to scroll down.
Final

Summary

Now we have a complete script that computes the average volume of IBM stock. You can download the results by clicking on the green download icon above the green box.
Answer
If you look at what our script has done, you see in Line 5 we:
  • Pulled in the data from our table using HCatalog, we took advantage that HCatalog provided us with location and schema information, if that needs to change in the future we would not have to rewrite our script.
  • Pig then went through all the rows in the table and discarded the ones where the stock_symbol field is not IBM
  • Then an index was built for the remaining records
  • The average of stock_volume was calculated on the records
We did it with 5 lines of Pig script code!

Saturday, 6 September 2014

What is Hadoop...Apache Hadoop Terms/Abbreviations.

What is Hadoop

Need of Hadoop:


Big Data:

Today we are surrounded by data, infact it won't be wrong to say that we live in data age. The amount of data is increasing exponentially. As the data is increasing it is becoming more and more challenging for organisations to maintain and analyse this huge amount of data. The success of an organisation largely depends on their ability to extract valuable information from this huge amount of data. Hadoop uses the approach of scaling out rather than scaling up to DEAL with this exploding data i.e using more systems of computer rather than bigger computer systems.

Data Storage:

The access speeds of hard drives have not increased proportionally to their storage capacities over the years. As a result it takes hours to read an entire hard disk and even more time to perform write operations. However this problem can be solved by dividing the data over multiple hard drives and parallely reading the data from these hard drives. 

Parallel read and write operations raises new issues like
  1. Need to handle hardware failures: Hadoop has its own distributed filesystem called HDFS which DEALS with hardware failures by data replication. We'll learn more about HDFS in upcoming posts .
  2. Ability to combine data from different drives: Most of the analysis will require data from different  hard drives. Hadoop uses MapReduce programming model which abstracts this problem by tranforming it into computations over key and value pair. We'll learn this programming model in upcoming posts. For now all you need to know is that there are two phases of computation Mapping and Reducing. Mixing occurs at the interface between these two phases. 
Thus in short we can say that Hadoop provides us with two components HDFS and MapReduce that provides reliable shared storage and analysis system.

Hadoop Introduction:


Hadoop is a framework for implementing distributed computing to process big data.Some of the key features of Hadoop are

  1. Accessibility: Hadoop runs on large clusters of commodity hardware.
  2. Robustness: Hadoop handles failures by replication of data.
  3. Scalability: Hadoop scales up linearly.
  4. Simplicity: Hadoop allows users to write parallel programs quickly.

          The image below shows how users interacts with a Hadoop cluster.


          Client interaction with Hadoop cluster

          In my next post I'll show where Hadoop stands in terms of comparison with other systems.
          Apache Hadoop Terms/Abbreviations:-

          HDFS - Hadoop Distributed File System
          GFS - Google File System
          JSON - Java Script Object Notation
          NN - NameNode
          DN - Data Node
          SNN - Secondary NameNode
          JT - Job Tracker
          TT - Task Tracker
          HA NN - Highly Available NameNode (or NN HA - NameNode Highly Available)
          REST - Representational State Transfer
          HiveQL - Hive SQL
          CDH - Cloudera’s Distribution Including Apache Hadoop
          ZKFC - ZooKeeper Failover Controller
          FUSE - Filesystem In Userspace
          YARN - Yet Another Resource Negotiator
          Amazon EC2 - Amazon Elastic Compute Cloud
          Amazon S3 - Amazon Simple Storage Service