Yatharth Text Analytics

Friday, 3 October 2014

How to Visualize Website Clickstream Data

Step 1: View and Refine the Website Clickstream Data in Hortonworks

In the “Loading Data into the Hortonworks Sandbox” tutorial, we loaded website data files into Hortonworks, and then used Hive queries to refine the data. Let’s take a closer look at that data.

Here’s a summary of the data we’re working with:

Omniture logs* – website log files containing information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID).
users* – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.
products* – CMS data that maps product categories to website URLs.

Let’s start by looking at the raw Omniture website data. In the Hortonworks Sandbox, click the File Browser icon in the toolbar at the top of the page, then select the Omniture.0.tsv.gz file.

The raw data file appears in the File Browser, and you can see that it contains information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID).
The Omniture log dataset contains about 4 million rows of data, which represents five days of clickstream data. Often, organizations will process weeks, months, or even years of data.
Now let’s look at the users table using HCatalog. Click the HCat icon in the toolbar at the top of the page, then click Browse Data in the users row.
The users table appears, and you can see the SWID, birth date, and gender columns.
You can also use HCatalog to view the data in the products table, which maps product categories to website URLs.
In the “Loading Data into the Hortonworks Sandbox” tutorial, we used Apache Hive to join the three data sets into one master set. This is easily accomplished in Hadoop, even when working with millions or billions of rows of data.
First, we used a Hive script to generate an “omniture” view that contained a subset of the data in the Omniture log table.
To view the omniture data in Hive, click Table, then click Browse Datain the omniture row.
The query results will appear, and you can see that the results include the data from the Omniture log table that were specified in the query.
Next, we created a “webloganalytics” script to join the omniture website log data to the CRM data (registered users) and CMS data (products). This Hive query executed a join to create a unified dataset across our data sources.
You can view the data generated by the webloganalytics script in Hive as described in the preceding steps.

Now that we have reviewed the refined website data, we can access the data with Excel.

Step 2: Access the Website Clickstream Data with Excel

In this section, we will use Excel Professional Plus 2013 to access the refined clickstream data.

In Windows, open a new Excel workbook, then select Data > From Other Sources > From Microsoft Query.
On the Choose Data Source pop-up, select the Hortonworks ODBC data source you installed previously, then click OK. The Hortonworks ODBC driver enables you to access Hortonworks data with Excel and other Business Intelligence (BI) applications that support ODBC.
After the connection to the sandbox is established, the Query Wizard appears. Select the webloganalytics table in the Available tables and columns box, then click the right arrow button to add the entire webloganalytics table to the query. Click Next to continue.
On the Filter Data screen, click Next to continue without filtering the data.
On the Sort Order screen, click Next to continue without setting a sort order.
Click Finish on the Query Wizard Finish screen to retrieve the query data from the sandbox and import it into Excel.
On the Import Data dialog box, click OK to accept the default settings and import the data as a table.
The imported query data appears in the Excel workbook.

Now that we have successfully imported Hortonworks Sandbox data into Microsoft Excel, we can use the Excel Power View feature to analyze and visualize the data.

Step 3: Visualize the Website Clickstream Data Using Excel Power View

Data visualization can help you optimize your website and convert more visits into sales and revenue. In this section we will:

Analyze the clickstream data by location
Filter the data by product category
Graph the website user data by age and gender
Pick a target customer segment
Identify a few web pages with the highest bounce rates

In the Excel workbook with the imported webloganalytics data, select Insert > Power View to open a new Power View report.

The Power View Fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the Pop Out icon to maximize the size of the data table.
Let’s start by taking a look at the countries of origin of our website visitors. In the Power View Fields area, leave the country checkbox selected, and clear all of the other checkboxes. The data table will update to reflect the selections.
On the Design tab in the top menu, click Map.
The map view displays a global view of the data. Now let’s take a look at a count of IP address by state. First, drag the ip field into the SIZE box.
Drag country from the Power View Fields area into the Filters area, then select the usa checkbox.
Next, drag state into the LOCATIONS box. Remove the country field from the LOCATIONS box by clicking the down-arrow and then Remove Field.
Use the map controls to zoom in on the United States. Move the pointer over each state to display the IP count for that state.
Our dataset includes product data, so we can display the product categories viewed by website visitors in each state. To display product categories in the map by color, drag the category field into the COLOR box.
The map displays the product categories by color for each state. Move the pointer over each state to display detailed category information. We can see that the largest number of page hits in Florida were for clothing, followed by shoes.
Now let’s look at the clothing data by age and gender so we can optimize our content for these customers. Select Insert > Power View to open a new Power View report.

To set up the data, set the following fields and filters:

In the Power View Fields area, select ip and age. All of the other fields should be unselected.
Drag category from the Power View Fields area into the Filters area, then select the clothing checkbox.
Drag gender from the Power View Fields area into the Filters area, then select the M (male) checkbox.

After setting these fields and filters, select Column Chart > Clustered Column in the top menu.

To finish setting up the chart, drag age into the AXIS box. Also, remove ip from the AXIS box by clicking the down-arrow and then Remove Field. The chart shows that the majority of men shopping for clothing on our website are between the ages of 22 and 30. With this information, we can optimize our content for this MARKET segment.
Let’s assume that our data includes information about website pages (URLs) with high bounce rates. A page is considered to have a high bounce rate if it is the last page a user visited before leaving the website. By filtering this URL data by our target age group, we can find out exactly which website pages we should optimize for this MARKET segment. Select Insert > Power View to open a new Power View report.

To set up the data, set the following fields and filters:

Drag age from the Power View Fields area into the Filters area, then drag the sliders to set the age range from 22 to 30.
Drag gender from the Power View Fields area into the Filters area, then select the M (male) checkbox.
Drag country from the Power View Fields area into the Filters area, then select the usa checkbox.
In the Power View Fields area, select url. All of the other fields should be unselected.
In the Power View Fields area, move the pointer over url, click the down-arrow, and then select Add to Table as Count.

After setting these fields and filters, select Column Chart > Clustered Column in the top menu.

The chart shows that we should focus on optimizing four of our website pages for the MARKET segment of men between the ages of 22 and 30. Now we can redesign these four pages and test the new designs based on our target demographic, thereby reducing the bounce rate and increasing customer retention and sales.
You can use the controls in the upper left corner of the map to sort by Count of URL in ascending order.

Now that you have successfully analyzed and visualized Hortonworks Sandbox data with Microsoft Excel, you can see how Excel and other BI tools can be used with the Hortonworks platform to derive insights about customers from various data sources.

The data in the Hortonworks platform can be refreshed frequently and used for basket analysis, A/B testing, personalized product recommendations, and other sales optimization activities.

Monday, 22 September 2014

Hello World! – An introduction to Hadoop with Hive and Pig

Overview of Apache Hadoop and Hortonworks Data Platform

The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform(HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials will walk you through bringing some sample data into HDP and manipulate it using the tools built into HDP. The idea is to show you how you can get started and show you how to accomplish tasks in HDP. HDP is FREE

to download and use in your enterprise and you can download it here: Hortonworks Data Platform
Download

The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here's a short video introduction to MapReduce: Introduction to MapReduce

Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.

The Hadoop Distributed File System

In this section we are going to take a closer look at some of the components we will be using in the Sandbox tutorials. Underlying all of these components is the Hadoop Distributed File System(HDFS™). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of datanodes.

Apache Hive™

The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.

Apache HCatalog

The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.

Apache Pig™

Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

That‘s all for now… let‘s get started with some examples of using these tools together to solve real problems!

Using HDP

Here we go! We're going to walk you through a series of step-by-step tutorials to get you up and running with the Hortonworks Data Platform(HDP).

Downloading Example Data

We'll need some example data for our lessons. For our first lesson, we'll be using stock ticker data from the New York Stock Exchange from the years 2000-2001. You can download this file here:

https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

The file is about 11 megabytes, and may take a few minutes to download. Fortunately, to learn 'Big Data' you don't have to use a massive dataset. You need only use tools that scale to massive datasets. Click and SAVE

this file to your computer.

Using the File Browser

You can reach the File Browser by clicking its icon:

The File Browser interface should be familiar to you as it is similar to the file manager on a Windows PC or Mac. We begin in our home directory. This is where we'll store the results of our work. File Browser also lets us upload files.

Uploading a File

To upload the example data you just downloaded,

Select the 'Upload' button
Select 'Files' and a pop-up window will appear.
Click the button which says, 'Upload a file'.
Locate the example data file you downloaded and select it.
A progress meter will appear. The upload may take a few moments.

When it is complete you'll see this:

Now click the file name "NYSE-2000-2001.tar.gz". You'll see it, displayed in tabular form:

You can use File Browser just like your own computer's file manager. Next register the dataset with HCatalog.

Loading the sample data into HCatalog

Now that we've uploaded a file to HDFS, we will register it with HCatalog to be able to access it in both Pig and Hive.

Select the HCatalog icon in the icon bar at the top of the page:

Select "Create a new table from file" from the Actions menu on the left.

Fill in the Table Name field with 'nyse_stocks'. Then click on Choose a file button. Select the file we just uploaded 'NYSE-2000-2001.tsv.gz'.

You will now see the options for importing your file into a table. The File options should be fine. In Table preview set all text type fields to Column Type 'string' and all decimal fields (ex: 12.55) to Column Type 'float.' The one exception is 'stock_volume' field should be set as 'bigint.' When everything is complete click on the "Create Table" button at the bottom.

A Short Apache Hive Tutorial

In the previous sections you:

Uploaded your data file into HDFS
Used Apache HCatalog to create a table

Apache Hive™ provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database.

To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.

Let's open Beeswax. Click on the bee icon on the top bar.

On the right hand side there is a query window and an execute button. We will be typing our queries in the query window. When you are done with a query please click on the execute button. Note: There is a limitation of one query in the composition window. You can not type multiple queries separated by semicolons.

Since we created our table in HCatalog, Hive automatically knows about it. We can see the tables that Hive knows about by clicking on the Tables tab.

In the list of the tables you will see our table, nyse_stocks. Hive inherits the schema and location information from HCatalog. This separates meta information like schema and location from the queries. If we did not have HCatalog we would have to build the table by providing location and schema information.

We can see the records by typing Select * from nyse_stocks in the Query window. Our results would be:

We can see the columns in the table by executing describe nyse_stocks

We will then get a description of the nyse table.

We can count the records with the query select count(*) from nyse_stocks. You can click on the Beeswax icon to get back to the query screen. Evaluate the expression by typing it in the query window and hitting execute.

This job takes longer and you can watch the job running in the log. When the job is complete you will see the results posted in the Results tab.

You can select specific records by using a query like select * from nyse_stocks where stock_symbol="IBM".

This will return the records with IBM.

So we have seen how we can use Apache Hive to easily query our data in HDFS using the Apache Hive query language. We took full advantage of HCatalog so we did not have to specify our schema or location of the data. Apache Hive allows people who are knowledgable in query languages like SQL to immediately become PRODUCTIVE

with Apache Hadoop. Once they know the schema of the data can they quickly and easily formulate queries.

Pig Basics Tutorial

In this tutorial we create and run Pig scripts. On the left is a list of scripts that we have created. In the middle is an area for us to compose our scripts. We will also load the data from the table we have stored in HCatalog. We will then filter out the records for the stock symbol IBM. Once we have done that we will calculate the average of closing stock prices over this period.

The basic steps will be:

Step 1: Create and name the script
Step 2: Loading the data
Step 3: Select all records starting with IBM
Step 4: iterate and average
Step 5: SAVE the script and execute it

Let's get started…

To get to the Pig interface click on the Pig icon on the icon bar at the top. This will bring up the Pig user interface. On the left is a list of your scripts and on the right is a composition box for your scripts.

A special feature of the interface is the Pig helper at the bottom. The Pig helper will provide us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions.

At the very bottom are status areas that will show the results of our script and log files

Step 1: Create and name the script

Open the Pig interface by clicking the Pig icon at the top of the screen
Title your script by filling in the title box

Step 2: Loading the data

Our first line in the script will load the table. We are going to use HCatalog because this allows us to share schema across tools and users within our Hadoop environment. HCatalog allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository. Since it is in HCatalog we can use the HCatLoader() function. Pig makes it easy by allowing us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table.

On the right hand side we can start adding our code at Line 1
We can use the Pig helper at the bottom of the screen to give us a template for the line. Click on Pig helper -> HCatalog->load template
The entry %TABLE% is highlighted in red for us. Type the name of the table which is nyse_stocks.
Remember to add the a = before the template. This SAVES the results into a. Note the `= has to have a space before and after it.

Our completed line of code will look like:

a = LOAD ‘default.nyse_stocks’ USING org.apache.hcatalog.pig.HCatLoader()

So now we have our table loaded into Pig and we stored it "a"

Step 3: Select all records starting with IBM

The next step is to select a subset of the records so that we just have the records for stock ticker of IBM. To do this in Pig we use the Filter operator. We tell Pig to Filter our table and keep all records where stock_symbol="IBM" and store this in b. With this one simple statement Pig will look at each record in the table and filter out all the ones that do not meet our criteria. The group statement is important because it groups the records by one or more relations. In this case we just specified all rather than specify the exact relation we need.

We can use Pig Help again by clicking on Pig helper->Relational Operators->FILTER template
We can replace %VAR% with "a" (hint: tab jumps you to the next field)
Our %COND% is "stock_symbol =='IBM'` " (note: single quotes are needed around IBM and don't forget the trailing semi-colon)
Pig helper -> Relational Operators->GROUP BY template
The first %VAR% is "b" and the second%VAR%is "all". You will need to correct an irregularity in the Pig syntax here. Remove the "BY`" in the line of code.
Again add the trailing semi-colon to the code.

So the final code will look like:

b = filter a by stock_symbol == 'IBM';
c = group b all;

Now we have extracted all the records with IBM as the stock_symbol.

Step 4: Iterate and Average

Now that we have the right set of records we can iterate through them and create the average. We use the "foreach" operator on the grouped data to iterate through all the records. The AVG() function creates the average of the stock_volume field. To wind it up we just print out the results which will be a single floating point number. If our results would be used for a future job we can SAVE

it back into a table.

Pig helper ->Relational Operators->FOREACH template will get us the code
Our first %VAR% is c and the second %VAR% is "AVG(b.stock_volume);"
We add the last line with Pig helper->I/O->DUMP template and replace %VAR% with "d".

Our last two lines of the script will look like:

d = foreach c generate AVG(b.stock_volume);
dump d;

So the variable "d" will contain the average volume of IBM stock when this
line is executed.

Step 5: Save the script and Execute it

We can save our completed script using the Save button at the bottom and then we can Execute it. This will create a MapReduce job(s) and after it runs we will get our results. At the bottom there will be a progress bar that shows the job status.

At the bottom we click on the Save button again
Then we click on the Execute button to run the script
Below the Execute button is a progress bar that will show you how things are running.
When the job completes you will see the results in the green box.
Click on the Logs link to see what happened when your script ran. The average of stock_volume This is where you will see any error messages. The log may scroll below the edge of your window so you may have to scroll down.

Summary

Now we have a complete script that computes the average volume of IBM stock. You can download the results by clicking on the green download icon above the green box.

If you look at what our script has done, you see in Line 5 we:

Pulled in the data from our table using HCatalog, we took advantage that HCatalog provided us with location and schema information, if that needs to change in the future we would not have to rewrite our script.
Pig then went through all the rows in the table and discarded the ones where the stock_symbol field is not IBM
Then an index was built for the remaining records
The average of stock_volume was calculated on the records

We did it with 5 lines of Pig script code!

Saturday, 6 September 2014

What is Hadoop...Apache Hadoop Terms/Abbreviations.

What is Hadoop

Need of Hadoop:

Big Data:

Today we are surrounded by data, infact it won't be wrong to say that we live in data age. The amount of data is increasing exponentially. As the data is increasing it is becoming more and more challenging for organisations to maintain and analyse this huge amount of data. The success of an organisation largely depends on their ability to extract valuable information from this huge amount of data. Hadoop uses the approach of scaling out rather than scaling up to DEAL

with this exploding data i.e using more systems of computer rather than bigger computer systems.

Data Storage:

The access speeds of hard drives have not increased proportionally to their storage capacities over the years. As a result it takes hours to read an entire hard disk and even more time to perform write operations. However this problem can be solved by dividing the data over multiple hard drives and parallely reading the data from these hard drives.

Parallel read and write operations raises new issues like

Need to handle hardware failures: Hadoop has its own distributed filesystem called HDFS which DEALS with hardware failures by data replication. We'll learn more about HDFS in upcoming posts .
Ability to combine data from different drives: Most of the analysis will require data from different hard drives. Hadoop uses MapReduce programming model which abstracts this problem by tranforming it into computations over key and value pair. We'll learn this programming model in upcoming posts. For now all you need to know is that there are two phases of computation Mapping and Reducing. Mixing occurs at the interface between these two phases.

Thus in short we can say that Hadoop provides us with two components HDFS and MapReduce that provides reliable shared storage and analysis system.

Hadoop Introduction:

Hadoop is a framework for implementing distributed computing to process big data.Some of the key features of Hadoop are

Accessibility: Hadoop runs on large clusters of commodity hardware.

Robustness: Hadoop handles failures by replication of data.

Scalability: Hadoop scales up linearly.

Simplicity: Hadoop allows users to write parallel programs quickly.

The image below shows how users interacts with a Hadoop cluster.

In my next post I'll show where Hadoop stands in terms of comparison with other systems.

Apache Hadoop Terms/Abbreviations:-

HDFS - Hadoop Distributed File System
GFS - Google File System
JSON - Java Script Object Notation
NN - NameNode
DN - Data Node
SNN - Secondary NameNode
JT - Job Tracker
TT - Task Tracker
HA NN - Highly Available NameNode (or NN HA - NameNode Highly Available)
REST - Representational State Transfer
HiveQL - Hive SQL
CDH - Cloudera’s Distribution Including Apache Hadoop
ZKFC - ZooKeeper Failover Controller
FUSE - Filesystem In Userspace
YARN - Yet Another Resource Negotiator
Amazon EC2 - Amazon Elastic Compute Cloud
Amazon S3 - Amazon Simple Storage Service

Total Pageviews