Total Pageviews

Tuesday 3 November 2015

9 MUST-HAVE SKILLS TO LAND TOP BIG DATA JOBS

1. Apache Hadoop


Sure, it’s entering its second decade now, but there’s no denying that Hadoop had a monstrous year in 2014 and is positioned for an even bigger 2015 as test clusters are moved into production and software vendors increasingly target the distributed storage and processing architecture. While the big data platform is powerful, Hadoop can be a fussy beast and requires care and feeding by proficient technicians. Those who know there way around the core components of the Hadoop stack–such as HDFS, MapReduce, Flume, Oozie, Hive, Pig, HBase, and YARN–will be in high demand.

2. Apache Spark


If Hadoop is a known quantity in the big data world, then Spark is a black horse candidate that has the raw potential to eclipse its elephantine cousin. The rapid rise of the in-memory stack is being proffered as a faster and simpler alternative to MapReduce-style analytics, either within a Hadoop framework or outside it. Best positioned as one of the components in a big data pipeline, Spark still requires technical expertise to program and run, thereby providing job opportunities for those in the know.

3. NoSQL


On the operational side of the big data house, distributed, scale-out NoSQL databases likeMongoDBand Couchbase are taking over jobs previously handled by monolithic SQL databases likeOracle and IBMDB2. On the Web and with mobile apps, NoSQL databases are often the source of data crunched in Hadoop, as well as the destination for application changes put in place after insight is gleaned from Hadoop. In the world of big data, Hadoop and NoSQL occupy opposite sides of a virtuous cycle.

4. Machine Learning and Data Mining


People have been mining for data as long as they’ve been collecting it. But in today’s big data world, data mining has reached a whole new level. One of the hottest fields in big data last year is machine learning, which is poised for a breakout year in 2015. Big data pros who can harness machine learning technology to build and train predictive analytic apps such as classification, recommendation, and personalization systems are in super high demand, and can command top dollar in the job market.

5. Statistical and Quantitative Analysis


This is what big data is all about. If you have a background in quantitative reasoning and a degree in a field like mathematics or statistics, you’re already halfway there. Add in expertise with a statistical tool like R, SAS, Matlab, SPSS, or Stata, and you’ve got this category locked down. In the past, most quants went to work on Wall Street, but thanks to the big data boom, companies in all sorts of industries across the country are in need of geeks with quantitative backgrounds.

6. SQL


The data-centric language is more than 40 years old, but the old grandpa still has a lot of life yet in today’s big data age. While it won’t be used with all big data challenges (see: NoSQL above), the simplify of Structured Query Language makes it a no-brainer for many of them. And thanks to initiatives like Cloudera‘s Impala, SQL is seeing new life as the lingua franca for the next-generation of Hadoop-scale data warehouses.

7. Data Visualization


Big data can be tough to comprehend, but in some circumstances there’s no replacement for actually getting your eyeballs onto data. You can do multivariate or logistic regression analysis on your data until the cows come home, but sometimes exploring just a sample of your data in a tool like Tableau orQlikview can tell you the shape of your data, and even reveal hidden details that change how you proceed. And if you want to be a data artist when you grow up, being well-versed in one or more visualization tools is practically a requirement.

8. General Purpose Programming Languages


Having experience programming applications in general-purpose languages like Java, C, Python, or Scala could give you the edge over other candidates whose skill sets are confined to analytics. According toWanted Analytics, there was a 337 percent increase in the number of job postings for “computer programmers” that required background in data analytics. Those who are comfortable at the intersection of traditional app dev and emerging analytics will be able to write their own tickets and move freely between end-user companies and big data startups.

9. Creativity and Problem Solving


No matter how many advanced analytic tools and techniques you have on your belt, nothing can replace the ability to think your way through a situation. The implements of big data will inevitably evolve and new technologies will replace the ones listed here. But if you’re equipped with a natural desire to know and a bulldog-like determination to find solutions, then you’ll always have a job offer waiting somewhere.

Thursday 15 October 2015

M2M - Future of Code


Machine-to-Machine Technology



                How far can big data go? What is next for big data analytics? According to GCN, the next horizon for big data may be machine-to-machine (M2M) technology. As coding of big data advances, Oracle is now considering big data “an ecosystem of solutions” that will incorporate embedded devices to do real-time analysis of events and information coming in from the “Internet of Things,” according to the Dr. Dobbs website. There is so much data that is being generated by all of the sensors and scanners we have today. All of this data is useless unless taken in context with other sparse data. Each strand of data may only be a few kilobytes in size but when put together with other sensors readings, they can create a much fuller picture. Applications are needed to not only enable devices to talk with others using M2M, but also to collect all the data and make sense of it. 

                The future of sparse data could even include what some consider Thin Data. Thin data could include simple sensors and threshold monitors built into the furniture and ancillary office equipment. When viewing all the sensors on the floor over time it might show the impact of changing temperature in the space, or moving the coffee machine. You could look at the actual usage data of fixtures like doors and lavatories. There is a huge potential for inferential data mining. And to even take thin data to the next level, include reproducing nano technology that is embedded in plant seeds. The nana agent would become part of the plant and relay state information as the plant grows. This would allow massive crop harvesters to know if and when the plants are in distress. Other areas of interest for thin data include monitoring traffic on bridges and roadways, or in a variety of weather monitors or tsunami prediction systems.

                Machina Research, a trade group for mobile device makers, predicts that within the next eight years, the number of connected devices using M2M will top 50 billion worldwide.  The connected-device population will include everything from power and gas meters that automatically report usage data, to wearable heart monitors that automatically tell a doctor when a patient needs to come in for a checkup, to traffic monitors and cars that will by 2014 automatically report their position and condition to authorities in the event of an accident. One of the most popular M2M setups has been to create a central hub that can be used by wireless and wired signals. The sensors in the field would record an event of significance, be it a temperature change, inventory leaving a specific area or even doors opening. The central hub would then send that information to a central location where an operator might turn down the AC, order more toner cartridges or tell security about suspicious activity. The future model for M2M, would eliminate the central hub or human interaction. The devices would communicate with each other and work out the problems on their own. This smart technology would decrease the logistics downtime associated with replacing an ink cartridge on a printer. Once the toner reached a low threshold, the printer would send a request/acquisition to the toner supplier and a replacement would immediately be shipped. Once the toner was received, it could be replaced. This turn-around time would be drastically better than having the printer fail because of low toner levels, then ordering it, having to wait on shipping, and then replacing the toner. 

                Humans won’t be completely removed from the equation. They will still need to be in the chain to oversee the different processes, but they will be more of a second pair of eyes and less of a direct supervisor. Humans will let the machines do they work, and will only get involved when the machine reports a problem, like a communications failure. More Applications software development will be needed in the future to connect those 50 billion devices. Another location to learn more about M2M development is the Eclipse Foundation.

Wednesday 30 September 2015

So you support Digital India? Here's what you can do as a Startup.

So you support the Digital India initiative? Brilliant. Many have taken to Facebook to changing their profiles to the tri-color, following Mark Z, but are unable to answer how they can contribute beyond that.
Here are the four areas, that you as an entrepreneur / startup or technologist should think about as areas that you can make a difference in and truly support this effort.
There are these five (or more) spaces that need to come together collectively to make this work.
1. Devices
This is going to capital intensive, but we need a ton of new devices. And devices don't just mean mobile phones and tablets (and phablets), but also devices like Kiosks, sensors (IoT and otherwise) and several uni-functional devices (think about the devices that bus conductors use - isnt it about time that we move to a self serve model? But thats another topic - to devices that traffic cops and policemen use to lookup). It also means local language and simplified user interfaces - so that even our grandparents could use it (quite literally so).
2. Access
This is where the deals that they are making with Google, Microsoft etc will come in play. This is where the net neutrality discussion is fiercely going on. How do we bring the cost of access down? And make connectivity available everywhere at an affordable cost? Telecom players will get in on this and I am not too worried about this piece. India is already the cheapest telecom network in the world and data costs are "reasonable". if we provide adequate value and opportunities to earn by virtue of being connected, access costs shouldn't matter.
3. Content
We'll need content, content creation infrastructure, not just in text, but perhaps in voice, IVR, TTS, voice recognition to be built up in local language. Video and Audio sites, audiences, and infrastructure needs to come up here. 
4. Services
This includes servers, the stack that goes on top, and the set of government related services that need to be built. Ideally governments have always talked about a modular system. But the better way of building these systems is in an API/Webhooks model, each talking to each other, interconnected and easily upgradeable independently. Aadhar is a critical piece of this. In technology speak, Aadhar is the Identity management service.
5. Security
Security will cut across all these layers. A digital India also means a digitally vulnerable India and we do have the likes of China who will hack into networks. Security on an infrastructure level, and on an individual access level is something to look at. Whoever designs this needs to keep in mind that India is democratic, so there is some philosophical / ideological designs have to be made that security doesn't turn into censorship and that the systems of democracy continue to remain open.
That last one is a big one and perhaps the most overlooked. Security and access will be the two controversial spaces.

PS: There is perhaps one another sixth piece. While I merged Services and Services together, there are some fundamental pieces missing. Thanks to startups like Reverie, we have rather elegant font-engines for embedded devices.

But have you ever tried storing data in regional language into databases? Storing names of people, town, places which are supposed to be in regional language, in english is half the reason why we can't have digital inefficiency - it is very hard to differentiate between two places or names that are similar sounding, or are the same thing. We can avoid some of those things if we can save and retrieve entries in regional languages. Some such building blocks (which should ideally be open source) is still missing. (by vhe startup guy.)                                                                       

Sunday 6 September 2015

13 big data and analytics companies to watch

Piling on
Investors are piling oodles of funds into startups focused on helping businesses quickly and cheaply sort through enormous data collections of both the structured and unstructured variety. Most of the newcomers not surprisingly have a cloud element to their offerings, leading to every sort of X-as-a-service pitch you can imagine. Here’s a look at some of the hottest big data and analytics companies (Note: I am only including those that have announced funding rounds this year).
Arcadia Data
Founded: 2012
Headquarters: San Mateo
Funding/investors: $11.5M in Series A funding in June, led by Mayfield.
Focus: Visual analytics and business intelligence for business users who need access to big data in enterprise Hadoop clusters without involving data scientists and other such experts. While optimized for Hadoop, customers can also use Arcadia’s technology for building browser-based apps across other data sources, including MySQL. A free download of the front-end visualization tool, Arcadia Instant, is available for Macs and Windows. Three of the co-founders come from Aster Data.
Cazena
Founded: 2014
Headquarters: Waltham, Mass.
Funding/investors: $28M, with a Series B round of $20M recently led by Formation 8.
Focus: Fast and inexpensive processing of big data in an encrypted cloud via what it calls enterprise Big Data-as-a-Service offerings broken down into Data Lake, Data Mart and Sandbox editions. The company, which emerged from stealth mode in July of 2015 and gets its name from the Hindi word for Treasure, is led mainly by former movers and shakers at Netezza, a data warehouse company acquired by IBM in 2010 for $1.7 billion.
DataHero
Founded: 2011
Headquarters: San Francisco
Funding/investors: $6.1M in Series A funding in May, led by Foundry Group.
Focus: Encrypted cloud-based business intelligence and analytics for end users, regardless of technical expertise. Self-service offering works with many popular services, including Salesforce.com, Dropbox, HubSpot and Google Analytics. Now led by CEO Ed Miller, a 25-year software industry veteran and entrepreneur.
DataTorrent
Founded: 2012
Headquarters: Santa Clara
Funding/investors: $23.8M, including a $15M Series B round in April led by Singtel Innov8.
Focus: Real-time big data analytics supported by an open source-based stream and batch processing engine that the company says can deal with billions of events per second in Hadoop clusters. Its co-founders both previously led engineering efforts at Yahoo (one at Yahoo Finance).
Enigma
Founded: 2011
Headquarters: New York City
Funding/investors: $32.7M, including a $28.2M Series B round announced in June and led by New Enterprise Associates.
Focus: Data discovery and analytics, both for enterprises that want better insight into their information and for the public, which can tap into a huge collection of public records.Enigma offers enterprises the Abstract data discovery platform and Signals analytics engine, which can be used to craft customized applications. Oh, and it uses one of those cool dot.io top-level domains for its URL.
Experfy
Founded: 2014
Headquarters: Harvard Innovation Lab in Boston
Funding/investors: $1.5M in seed funding, led by XPRIZE Chairman and CEO Peter Diamandis.
Focus: Cloud-based consulting marketplace designed to match up big data and analytics experts with clients who need their services. Experfy provides advisory services, big data readiness assessments, road maps, predictive dashboards, algorithms, and a number of custom analytics solutions for mid-market and Fortune 500s, according to co-CEO and Founder Harpreet Singh.
Interana
Founded: 2013
Headquarters: Menlo Park, Calif.
Funding/investors: $28.2M, including $20M in a Series B round in January led by Index Ventures.
Focus: Interactive analytics to answer business questions about how customers behave and products are used. A proprietary database enables its offering to deal with billions of events.Company was formed by a couple of former Facebook engineers and one of their wives, who serves as CEO.
JethroData
Founded: 2012
Headquarters: New York City
Funding/investors: $12.6M, including an $8.1M Series B round led by Square Peg Capital in June.
Focus: Index-based SQL engine for big data on Hadoop that enables speedy transactions with business intelligence offerings such as those from Qlik, Tableau and MicroStrategy. Co-founders come from Israel and have a strong track record at companies located in the United States. Shown are JethroData cofounders Eli Singer, Boaz Raufman and Ronen Ovadya
Neokami
Founded: 2014
Headquarters: Munich, Germany
Funding/investors: $1.1M in seed funding from angel investors.
Focus: “Revolutionising the Machine-Learning-as-a-Service space” that you probably didn’t even realize existed. This company exploits its expertise in artificial intelligence for CRM, security analytics and data science tools, the latter of which can be used to make sense of unstructured data via self-learning algorithms.
SlamData
Founded: 2014
Headquarters: Boulder
Funding/investors: $3.6M, led by True Ventures
Focus: Crafting the commercial version of the SlamData open-source project in an effort to help customers visualize semi-structured NoSQL data in a secure and manageable way. A key strength for SlamData is that it works natively with NoSQL databases like MongoDB. CEO Jeff Carr says SlamData is following in the footsteps of Tableau and Splunk, which have targeted structured and unstructured data, respectively.
Snowflake Computing
Founded: 2012
Headquarters: San Mateo
Funding/investors: $71M, with the most recent round of $45M led by Altimeter Capital.
Focus: This company doesn’t hide from the relatively old-fashioned term data warehouse, but puts a new twist on the technology by recreating it for the cloud and boasting of “the flexibility of big data platforms.” Led by former Microsoft and Juniper honcho Bob Muglia. Snowflake positions its offering as an elastic data warehouse.
Talena
Founded: 2011
Headquarters: Milpitas, Calif.
Funding/investors: $12M, including from Canaan Partners, announced in August.
Focus: Emerged from stealth mode in August with management software for big data environments, ensuring applications based on Hadoop, NoSQL and other platforms are available throughout their life cycle. Involves advances in data storage, recovery and compliance. CEO and Founder Nitan Donde led engineering efforts at Aster Data, EMC and other firms.
Tamr
Founded: 2013
Headquarters: Cambridge, Mass.
Funding/investors: $42.4M, with Series B funding of $25.2M from Hewlett-Packard Ventures and others announced in June.
Focus: Co-founder and database legend Michael Stonebraker says Tamr’s ‘scalable data-unification platform will be the next big thing in data and analytics ─ similar to how column-store databases were the next big thing in 2004.” In other words, Tamr uses machine learning and human input to enable customers to make use of data currently silo-ed in disparate databases, spreadsheets, logs and partner resources. Tamr’s tech got its start at MIT’s CSAIL.
                                                                                                   

Wednesday 19 August 2015

Integrate of R with Java using Rserve

Introduction

Building Machine Learning based analytics applications require usage of a range of technologies. Java proves to be a great language for building enterprise solutions, however java lacks on analytics front. To compensate this gap we have languages like R which has a rich set of Machine Learning and Statistical libraries. Integrating both these technologies we could create high end Machine Learning based applications. In the previous post  Integrate R with Java using rJava I have explained in detail, what benefits we could achieve by integrating R with Java, and what are the application architectures where such kind of integration is required.
There are two main packages to integrate R with Java:
  1. rJava
  2. Rserve
In the previous post we have discussed the process of integrating R with Java using rJava library. In this post we will be discussing the differences betweenrJava and Rserve packages. We will also be discussing step by step process ofHow to integrate R with Java using Rserve library.

Difference between Rserve and rJava packages

The main differences between rJava and Rserve could be discussed under following headings:
  1. Operating in a server mode.
  2. Ease of use.
Operating in a server mode means whether the library runs as a server, to which a client program could connect and perform the task. Or does the library is used as an API this is directly called from inside the program and has no client server nature. Based on this criterion, rJava is used as an API i.e. it does not involve any client server communication. The program using rJava directly uses it to execute the R code. On the other hand Rserve works in a client server manner. You start an instance of Rserver server and client could communicate to Rserve over TCP/IP (for more information about Rserve library referRserve).
Note: Basically rJava provides a low level or system level communication to R, while Rserve works on TCP/IP communication.
Ease of use is a subjective criterion. But as per my experience I find Rserve better compared to rJava. As we have seen in the previous tutorial that while configuring rJava we have to set various paths, we have to configure various dlls. While in case of Rserve you simple add its library to your R setup and directly use it from your Java code.
Now that we have compared the features of rJava and Rserve we will start with the technicalities of integrating R with Java using Rserve.

How to Integrate R with Java using Rserve package?

The components and their versions used for this tutorial are mentioned below:
  1. Operating System: Windows 7, 32 bit.
  2. JDK: Version 1.7 or above.
  3. Eclipse: Luna.
  4. R Workbench: This is the GUI used to run R scripts. We are using R 3.1.3you could download the latest version from the same link.

Configuring R

Simply install the R workbench downloaded above. Try installing the workbench at a location other than C:\ drive as it has some permission issues. For the purpose of this tutorial R workbench is installed inD:\ProgramFiles\R directory.

Integration Steps

Step-1 (Installing Rserve package)
Open your R workbench and enter following command on your R console
A window with header CRAN mirror will open asking you to select the mirror from which you want to install the package. For the purpose of this tutorial we have chosen USA (KS). You could select any other mirror also. After selecting the mirror press OK. R will start installing the package, and after the package is installed your R console will show following message.
1Fig. 1
Step-2 (Starting Rserve server)
Once you have installed the Rserve package, you need to start the server. To start the server firstly you have to import the package in your current R instance. Type following command on your R console to import Rserve package.
Then type following command to start Rserve server at default port 6311.
Your console will look something like:
2Fig. 2
That is all you need to run an instance of Rserve server.
Step-3 (Creating JAVA Client)
Now that you have Rserve server running, you need a Java program that communicates with R using Rserve and uses R functionality inside Java code. We will be creating the program in eclipse as follows:
  1. Open eclipse Luna.
  2. Create a Java project named RserveProject.
  3. Rserve provides some client jars that are used inside Java program to communicate with R. These jar files are included in the Rserve package that you installed from R console.
  4. For my installation the jar files are located at D:\ProgramFiles\R\R-3.1.3\library\Rserve\java\ however if you installed your R setup somewhere else your path to these files will be<YOUR_R_HOME>\library\Rserve\java\. Main jar files needed are:REngine.jar and Rserve.jar.
  5. You need to include these two jars in your eclipse project. In the Package Explorer section right click on the project and select Build Path > Configure Build Path.3                                                         Fig. 3
  6. In window titled Properties for RserveProject select Libraries tab.4                                                        Fig. 4
  7. Now select Add External JARs button in the right panel. Browse to location <YOUR_R_HOME>\library\Rserve\java\ and select filesREngine.jar and Rserve.jar click Open button on current window andOk button on next window.
  8. Now the structure of your eclipse project RserveProject will look similar to the figure below:                               5                                                        Fig. 5
  9. New create a package named pkg under src folder of RserveProject and create a class Temp.java under pkg.
That is all that is needed with the configuration part of java. Now we need to create Java code.
Step-4 (JAVA client for Rserve)
For Java code we will be using a use-case where we have a R vector c(1,2,3,4)and we want to compute its mean using R. The Java code for the use-case is as follows:
Step-4 (Output of Java program)
As the R vector is c(1,2,3,4) so its mean should be (1+2+3+4)/4=10/4=2.5

Calling User-defined R functions in Java

Above program shows how to use built-in functions of R from Java. However you may face a situation where you have some user defined functions in a R script and want to use those functions from Java code. Lets’ say we need a custom myAdd() function, that adds two integers. To solve this use case proceed as follows:
Step-1 (Create a R script)
Open a text editor and paste following code in it:
Here we are just defining a function myAdd() that takes two parameters x andy and returns their sum. Save this file as MyScript.R on your disk (we have used D:\MyScript.R location)
Step-2 (Java program to call external R script)
Now create a Java program as you created above and use following code:
Here you are firstly importing the code written in D:\MyScript.R in your Rserve context. Then you are using user defined function myAdd(). Running this code should return result 30.
A note on slashes in the path: As you could see that we have used four slashes (\\\\) in the above path. In R if you are using back slash (\) in thesource() command, then you have to escape it with another \. So actual R command is:
Now as this command will be passed as String in Java code, so in Java code you have to escape each slash (\) with another slash (\). java.lang.String format of above R command is:
However if you are using front slash (/) in your path then there is no change is R and Java syntax. Then R command looks like:
Java version is also similar:

A note on multi-threaded nature of Rserve

As Rserve library runs in the form of a server, it could handle multiple requests simultaneously. With this we mean that when we start an instance of Rserve with command
then this instance should handle multiple requests send by different invocations of line
from Java code. Rserve is capable of handling multiple requests simultaneously by creating a separate process for each request using fork() system call.
Linux environment
On Linux environment you could simply launch an instance of Rserve using
command and then use various
calls to create multiple connections to Rserve (as fork() facility is available on Linux environment)
Windows environment
As fork() is not present on windows you could not handle multiple requests using above commands (however for one request at a time it works fine). There is a work around to handle this situation. Suppose you have a Java application that creates 3 threads and all these threads create a connection to R using
this scenario will not work on windows as windows will not be able to create a new thread for each call. To overcome this situation start 3 instances of Rserve from R console:
Now that you have 3 separate instances your 3 threads could easily connect to these 3 instances:
Thread 1:
Thread 2:
Thread 3: