Total Pageviews

Saturday, 16 April 2016

Development and deployment of Spark applications with Scala, Eclipse, and sbt – Installation & configuration

The purpose of this tutorial is to setup the necessary environment for development and deployment of Spark applications with Scala. Specifically, we are going to use the Eclipse IDE for development of applications and deploy them with spark-submit. The glue that ties everything together is the sbtinteractive build tool. The sbt tool provides plugins used to:
  1. Create an Eclipse Scala project with Spark dependencies
  2. Create a jar assembly with all necessary dependencies so that it can be deployed and launched using spark-submit
The steps presented assume just a basic Linux installation with Java SE Development Kit 7. We are going to download, install, and configure the following software components:
  1. The latest sbt building tool
  2. Scala IDE for Eclipse
  3. Spark 1.4.1

Installation instructions

Installing sbt
sbt download and installation is straightforward, as shown in the commands below:
~$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.8/sbt-0.13.8.tgz
~$ gunzip sbt-0.13.8.tgz
~$ tar -xvf sbt-0.13.8.tar
~$ export PATH=$PATH:~/sbt/bin
The last command adds the sbt executable into the PATH shell variable. Now we can call sbt from any directory to create and package our projects. The first time it runs it will need to fetch some data over the internet, so be patient!
We are not quite done with the sbt yet. We need to install two very important plugins.
sbteclipse plugin
sbteclipse is the sbt plugin for creating Eclipse project definitions.
Add sbteclipse to your plugin definition file (or create one if doesn’t exist). You can use either:
  • the global file (for version 0.13 and up) at ~/.sbt/0.13/plugins/plugins.sbt
  • the project-specific file at PROJECT_DIR/project/plugins.sbt
For the latest version add the following line in plugins.sbt:
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0") 
as shown below (use ^D to end the cat command):
~$ mkdir -p ~/.sbt/0.13/plugins # mkdir -p creates all necessary directories in the path in the given order
~$ cat >> ~/.sbt/0.13/plugins/plugins.sbt
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
<ctrl>+D
~$
After installation, the next time we launch sbt we will be able to use the additional commandeclipse.
sbt-assembly plugin
sbt-assembly is an sbt plugin that creates a fat JAR of your project with all of its dependencies included. According to Spark documentation, if your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. This is why we need the sbt-assembly plugin. When creating assembly jars, list Spark and Hadoop asprovided dependencies; these need not be bundled, since they are provided by the cluster manager at runtime. Once you have an assembled jar, you can call the bin/spark-submit script as shown later below while passing your jar.
~$ mkdir -p ~/.sbt/0.13/plugins
~$ cat >> ~/.sbt/0.13/plugins/plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
<ctrl>+D
~$
Installing Scala IDE for Eclipse
Downloading and installing the Scala IDE for Eclipse is also straightforward:
~$ wget http://downloads.typesafe.com/scalaide-pack/4.1.1-vfinal-luna-211-20150728/scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar.gz
~$ gunzip scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar.gz
~$ tar -xvf scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64.tar
~$ ~/eclipse/eclipse # this runs Eclipse IDE
As you can see from the figure below, a new menu item named Scala is added in the classic Eclipse menu bar:
post_eclipse_ide
Installing Spark 1.4.1 (this may take a while)
Instructions for downloading and building Spark are provided here. There are several options available; since Spark is packaged with a self-contained Maven installation to ease building and deployment of Spark from source (located under the build/ directory), we choose this option. Notice that we build Spark with the latest Scala 2.11 (included in the Eclipse Scala IDE we have just downloaded in the previous step):
~$ wget http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1.tgz
~$ gzunip spark-1.4.1.tgz
~$ tar -xvf spark-1.4.1.tar
~$ cd spark-1.4.1/
~spark-1.4.1/$ build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -DskipTests clean package
...
...
~spark-1.4.1/$ export PATH=$PATH:~/spark-1.4.1/bin # make all Spark binaries accessible
As with sbt above, we include the last command so as to make Spark binaries accessible from everywhere.
Having installed all the necessary components, we now proceed to demonstrate the creation of a simple application.

Creating a sample application (sbt package)

The task now is to create a self contained Scala/Spark application using sbt and the Eclipse IDE.
Creating sample sbt project
For this demonstration, we will create a very simple Spark application in Scala named SampleApp(creating a realistic application will be covered in a follow-up post). First we prepare the directory structure:
~$ mkdir SampleApp
~$ cd SampleApp
~/SampleApp$ mkdir -p src/main/scala # mandatory structure
In the directory~/SampleApp/src/main/scala we create the following Scala file SampleApp.scala (using just a text editor for now):
/* SampleApp.scala:
   This application simply counts the number of lines that contain "val" from itself
 */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
 
object SampleApp {
  def main(args: Array[String]) {
    val txtFile = "/home/osboxes/SampleApp/src/main/scala/SampleApp.scala"
    val conf = new SparkConf().setAppName("Sample Application")
    val sc = new SparkContext(conf)
    val txtFileLines = sc.textFile(txtFile , 2).cache()
    val numAs = txtFileLines .filter(line => line.contains("val")).count()
    println("Lines with val: %s".format(numAs))
  }
}
In the directory ~/SampleApp we create a configuration file sample.sbt containing the following:
name := "Sample Project"
 
version := "1.0"
 
scalaVersion := "2.11.7"
 
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
The resulting directory structure should be as shown below:
osboxes@osboxes:~/SampleApp$ find .
.
./sample.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SampleApp.scala
Use sbt to package and launch the sample application
We are now ready to package everything into a single jar and deploy using spark-submit. Notice that the sbt tool creates a hidden directory in the home folder ~/.ivy2/ that contains all cached jars used for packaging the application.
~/SampleApp$ sbt package
...
[info] Loading global plugins from /home/osboxes/.sbt/0.13/plugins
[info] Set current project to Sample Project (in build file:/home/osboxes/SampleApp/)
...
...
[info] Compiling 1 Scala source to /home/osboxes/SampleApp/target/scala-2.11/classes...
[info] Packaging /home/osboxes/SampleApp/target/scala-2.11/sample-project_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 15 s, completed 30-Jul-2015 18:55:17
~/SampleApp$
Notice that the result of the packaging is the file sample-project_2.11-1.0.jar. This is deployed as follows:
~/SampleApp$ spark-submit --class "SampleApp" --master local[2] target/scala-2.11/sample-project_2.11-1.0.jar
...
...
Lines with val: 6
~/SampleApp$
We can easily verify that the number of lines in our simple script containing “val” are indeed six (fiveval assignments plus one occurrence in the println command argument).
Use sbt to create an Eclipse project
In order to create an Eclipse project for this sample application, we issue the following sbtcommand:
~/SampleApp$ sbt eclipse # this choice was installed with the sbteclipse plugin
[info] Loading global plugins from /home/osboxes/.sbt/0.13/plugins
[info] Set current project to Sample Project (in build file:/home/osboxes/SampleApp/)
[info] About to create Eclipse project files for your project(s).
[info] Successfully created Eclipse project files for project(s):
[info] Sample Project
~/SampleApp$
Now the Eclipse project is created inside the ~/SampleApp directory. We use Eclipse to import an existing project:
post_eclipse_import
Select Browse to seach for the ~/SampleApp directory.
post_eclipse_import_select
Do not check the option Copy projects into workspace
post_eclipse_import_finish
The result is the complete project tree in the Package Explorer of Eclipse. All Spark and Hadoop related dependencies have been automatically imported from sbt. Now you can editSampleApp.scala directly from Eclipse using code completion features, syntactic highlighters and more.
post_eclipse_import_done
Run the sample application from Eclipse
Source code editing using Eclipse can be real fun! Code completion, refactoring, smart indenter, code formatting, syntax highlighting – you name it, Eclipse provides it! But what about running the application? We can do that too, with a little configuration and a minor addition in the Scala source code.
From the Eclipse menu bar select Run -> Run Configurations. On the left panel right click on Scala Application and select New. This opens the Create, manage, and run configurations window:
post_eclipse_run_config
Enter the name of the class we want to deploy – in this case it is SampleApp. Then press Apply and the run configuration is ready to go. The last step is to modify the source code to reflect the Spark runtime configuration. In this example it suffices to set the master URL for launching to "local[2]". This will run the application locally in a standalone mode.
val conf = new SparkConf().setAppName("Sample Application").setMaster("local[2]")
Now we are ready to launch the application from Eclipse by selecting Run->Sample Application:
post_eclipse_run_example
From this point onwards, we can use the Eclipse IDE to further develop our application and run some test instances during the process. When we are confident with our code, we can switch to sbtpackaging/deployment and run our application in systems containing a Spark 1.4.1 installation. The developing cycle can be as follows:
  1. Use Eclipse to modify the project and test it
  2. Use the sbt package to create the final jar
  3. Deploy using spark-submit
  4. Go to step 1, if necessary, and refine further

Tuesday, 22 March 2016

SBT run: choose automatically the App to launch

I like Holi because, beside resting from work and having fun with family and friends, usually there is time to learn something new. During this Holi I've been playing with Scala: First, trying to finish the Coursera Functional Programming Principles course. Later, working a bit in a personal project. Better late than never :smile
As for my personal project, it provides more than one executable entry point:
Working with Scala and SBT, the command sbt run looks like the natural alternative. It seeks for every Scala Object in the project that could be used as the assembly entry point:
  • an object that defines a main method
  • an object that inherits from App
If your application has more than one object fitting the previous requirement, the commandsbt run will ask for your help to finish the execution.
Let's consider the following snippet of code, having two objects that define a main method:
# File src/main/Foo.scala
object Foo {
    def main(args: Array[String]) = println("Hello from Foo")
}

# File src/main/Bar.scala
object Bar extends App{
    println("Hello from Bar")
}
When you execute the sbt run command, the following text shows up:
> sbt run

Multiple main classes detected, select one to run:

 [1] Bar
 [2] Foo

Enter number: 2
[info] Running Foo
Hello from Foo
[success] Total time: 29 s, completed Dec 30, 2012 11:36:28 PM
It requires human action (in the previous example, fill in the number 2), as the run command does not receive any parameter to automate the process.
Fortunately, there's an easy solution using the SBT plugin sbt-start-script squirrel emoji. You just need to follow these three steps:
  • Create (or update) the file project/plugins.sbt, including:
addSbtPlugin("com.typesafe.sbt" % "sbt-start-script" % "0.6.0")
  • Create (or update) the file build.sbt, adding:
import com.typesafe.sbt.SbtStartScript
seq(SbtStartScript.startScriptForClassesSettings: _*)
  • Execute:
sbt update
sbt start-script
As result, a new file target/start is created. A file that requires the main class name to be executed as the first argument:
> target/start Foo
Hello from Foo

> target/start Bar
Hello from Bar
Two last tips:
  • In case your program just has a single main class, the script does not require any argument.
  • Remember to add the automatically generated file target/start to your CVS

Wednesday, 3 February 2016

Mosquitos and Average Temperature plot using R

#setwd("../working")
library(data.table)
library(ggplot2)
library(lubridate)
dataFolder = "../input"
dtTrain = fread(file.path(dataFolder,"train.csv"))
weather = fread(file.path(dataFolder, "weather.csv"))

dtTrain[,Date:=as.Date(Date)]
dtTrain[,':=' (year=year(Date), dayOfYear=yday(Date))]

weather[Tavg=='M', Tavg:='-1']
weather[,':=' (Date=as.Date(Date),Tmax=as.integer(Tmax),Tmin=as.integer(Tmin),Tavg=as.integer(Tavg))]
weather[, ':=' (year=year(Date), dayOfYear=yday(Date))]
weather[Tavg==-1,Tavg:=as.integer((Tmax+Tmin)/2)]

mosquitosStats<-dtTrain[,.(dayOfYear,year,NumMosquitos.sum=sum(NumMosquitos)),by=Date]
#Total mosquitos by Date
log_scale_mosquitos<-ggplot(mosquitosStats)+geom_point(aes(dayOfYear, log(NumMosquitos.sum),color=NumMosquitos.sum))+
  facet_grid(year ~ .)+
  scale_color_gradient(low="blue", high="Red")+
  ggtitle("Total mosquitos by Day")
ggsave("log_scale_mosquitos.png", log_scale_mosquitos)

mosquitosSum<-dtTrain[,.(NumMosquitos.sum=sum(NumMosquitos)),by=Date]

weatherTrain<-weather[Station==1&year%%2==1,]
mosquitosByDate<-merge(weatherTrain,mosquitosSum,by="Date",all.x=TRUE)
mosquitosByDate[is.na(NumMosquitos.sum), NumMosquitos.sum:=0]
#put mosquitos count and temperature into one plot,scle mosquitos count and Tavg to fit a similar range
mosquitos_temperature_plot <- ggplot(mosquitosByDate)+geom_point(aes(dayOfYear, Tavg/10,color=Tavg))+
  geom_line(aes(dayOfYear, log(NumMosquitos.sum)),color="olivedrab")+
  facet_grid(year ~ .)+
  scale_color_gradient(low="blue", high="Red")+
  ggtitle("log(NumMosquitos.sum) and Tavg plot")
ggsave("mosquitos_temperature_plot.png",mosquitos_temperature_plot)


                
This script has been released under the Apache 2.0 open source license.



Loading required package: methods

Attaching package: ‘lubridate’

The following objects are masked from ‘package:data.table’:

    hour, mday, month, quarter, wday, week, yday, year

Saving 12.5 x 6.67 in image
Saving 12.5 x 6.67 in image

Tuesday, 3 November 2015

9 MUST-HAVE SKILLS TO LAND TOP BIG DATA JOBS

1. Apache Hadoop


Sure, it’s entering its second decade now, but there’s no denying that Hadoop had a monstrous year in 2014 and is positioned for an even bigger 2015 as test clusters are moved into production and software vendors increasingly target the distributed storage and processing architecture. While the big data platform is powerful, Hadoop can be a fussy beast and requires care and feeding by proficient technicians. Those who know there way around the core components of the Hadoop stack–such as HDFS, MapReduce, Flume, Oozie, Hive, Pig, HBase, and YARN–will be in high demand.

2. Apache Spark


If Hadoop is a known quantity in the big data world, then Spark is a black horse candidate that has the raw potential to eclipse its elephantine cousin. The rapid rise of the in-memory stack is being proffered as a faster and simpler alternative to MapReduce-style analytics, either within a Hadoop framework or outside it. Best positioned as one of the components in a big data pipeline, Spark still requires technical expertise to program and run, thereby providing job opportunities for those in the know.

3. NoSQL


On the operational side of the big data house, distributed, scale-out NoSQL databases likeMongoDBand Couchbase are taking over jobs previously handled by monolithic SQL databases likeOracle and IBMDB2. On the Web and with mobile apps, NoSQL databases are often the source of data crunched in Hadoop, as well as the destination for application changes put in place after insight is gleaned from Hadoop. In the world of big data, Hadoop and NoSQL occupy opposite sides of a virtuous cycle.

4. Machine Learning and Data Mining


People have been mining for data as long as they’ve been collecting it. But in today’s big data world, data mining has reached a whole new level. One of the hottest fields in big data last year is machine learning, which is poised for a breakout year in 2015. Big data pros who can harness machine learning technology to build and train predictive analytic apps such as classification, recommendation, and personalization systems are in super high demand, and can command top dollar in the job market.

5. Statistical and Quantitative Analysis


This is what big data is all about. If you have a background in quantitative reasoning and a degree in a field like mathematics or statistics, you’re already halfway there. Add in expertise with a statistical tool like R, SAS, Matlab, SPSS, or Stata, and you’ve got this category locked down. In the past, most quants went to work on Wall Street, but thanks to the big data boom, companies in all sorts of industries across the country are in need of geeks with quantitative backgrounds.

6. SQL


The data-centric language is more than 40 years old, but the old grandpa still has a lot of life yet in today’s big data age. While it won’t be used with all big data challenges (see: NoSQL above), the simplify of Structured Query Language makes it a no-brainer for many of them. And thanks to initiatives like Cloudera‘s Impala, SQL is seeing new life as the lingua franca for the next-generation of Hadoop-scale data warehouses.

7. Data Visualization


Big data can be tough to comprehend, but in some circumstances there’s no replacement for actually getting your eyeballs onto data. You can do multivariate or logistic regression analysis on your data until the cows come home, but sometimes exploring just a sample of your data in a tool like Tableau orQlikview can tell you the shape of your data, and even reveal hidden details that change how you proceed. And if you want to be a data artist when you grow up, being well-versed in one or more visualization tools is practically a requirement.

8. General Purpose Programming Languages


Having experience programming applications in general-purpose languages like Java, C, Python, or Scala could give you the edge over other candidates whose skill sets are confined to analytics. According toWanted Analytics, there was a 337 percent increase in the number of job postings for “computer programmers” that required background in data analytics. Those who are comfortable at the intersection of traditional app dev and emerging analytics will be able to write their own tickets and move freely between end-user companies and big data startups.

9. Creativity and Problem Solving


No matter how many advanced analytic tools and techniques you have on your belt, nothing can replace the ability to think your way through a situation. The implements of big data will inevitably evolve and new technologies will replace the ones listed here. But if you’re equipped with a natural desire to know and a bulldog-like determination to find solutions, then you’ll always have a job offer waiting somewhere.

Thursday, 15 October 2015

M2M - Future of Code


Machine-to-Machine Technology



                How far can big data go? What is next for big data analytics? According to GCN, the next horizon for big data may be machine-to-machine (M2M) technology. As coding of big data advances, Oracle is now considering big data “an ecosystem of solutions” that will incorporate embedded devices to do real-time analysis of events and information coming in from the “Internet of Things,” according to the Dr. Dobbs website. There is so much data that is being generated by all of the sensors and scanners we have today. All of this data is useless unless taken in context with other sparse data. Each strand of data may only be a few kilobytes in size but when put together with other sensors readings, they can create a much fuller picture. Applications are needed to not only enable devices to talk with others using M2M, but also to collect all the data and make sense of it. 

                The future of sparse data could even include what some consider Thin Data. Thin data could include simple sensors and threshold monitors built into the furniture and ancillary office equipment. When viewing all the sensors on the floor over time it might show the impact of changing temperature in the space, or moving the coffee machine. You could look at the actual usage data of fixtures like doors and lavatories. There is a huge potential for inferential data mining. And to even take thin data to the next level, include reproducing nano technology that is embedded in plant seeds. The nana agent would become part of the plant and relay state information as the plant grows. This would allow massive crop harvesters to know if and when the plants are in distress. Other areas of interest for thin data include monitoring traffic on bridges and roadways, or in a variety of weather monitors or tsunami prediction systems.

                Machina Research, a trade group for mobile device makers, predicts that within the next eight years, the number of connected devices using M2M will top 50 billion worldwide.  The connected-device population will include everything from power and gas meters that automatically report usage data, to wearable heart monitors that automatically tell a doctor when a patient needs to come in for a checkup, to traffic monitors and cars that will by 2014 automatically report their position and condition to authorities in the event of an accident. One of the most popular M2M setups has been to create a central hub that can be used by wireless and wired signals. The sensors in the field would record an event of significance, be it a temperature change, inventory leaving a specific area or even doors opening. The central hub would then send that information to a central location where an operator might turn down the AC, order more toner cartridges or tell security about suspicious activity. The future model for M2M, would eliminate the central hub or human interaction. The devices would communicate with each other and work out the problems on their own. This smart technology would decrease the logistics downtime associated with replacing an ink cartridge on a printer. Once the toner reached a low threshold, the printer would send a request/acquisition to the toner supplier and a replacement would immediately be shipped. Once the toner was received, it could be replaced. This turn-around time would be drastically better than having the printer fail because of low toner levels, then ordering it, having to wait on shipping, and then replacing the toner. 

                Humans won’t be completely removed from the equation. They will still need to be in the chain to oversee the different processes, but they will be more of a second pair of eyes and less of a direct supervisor. Humans will let the machines do they work, and will only get involved when the machine reports a problem, like a communications failure. More Applications software development will be needed in the future to connect those 50 billion devices. Another location to learn more about M2M development is the Eclipse Foundation.

Wednesday, 30 September 2015

So you support Digital India? Here's what you can do as a Startup.

So you support the Digital India initiative? Brilliant. Many have taken to Facebook to changing their profiles to the tri-color, following Mark Z, but are unable to answer how they can contribute beyond that.
Here are the four areas, that you as an entrepreneur / startup or technologist should think about as areas that you can make a difference in and truly support this effort.
There are these five (or more) spaces that need to come together collectively to make this work.
1. Devices
This is going to capital intensive, but we need a ton of new devices. And devices don't just mean mobile phones and tablets (and phablets), but also devices like Kiosks, sensors (IoT and otherwise) and several uni-functional devices (think about the devices that bus conductors use - isnt it about time that we move to a self serve model? But thats another topic - to devices that traffic cops and policemen use to lookup). It also means local language and simplified user interfaces - so that even our grandparents could use it (quite literally so).
2. Access
This is where the deals that they are making with Google, Microsoft etc will come in play. This is where the net neutrality discussion is fiercely going on. How do we bring the cost of access down? And make connectivity available everywhere at an affordable cost? Telecom players will get in on this and I am not too worried about this piece. India is already the cheapest telecom network in the world and data costs are "reasonable". if we provide adequate value and opportunities to earn by virtue of being connected, access costs shouldn't matter.
3. Content
We'll need content, content creation infrastructure, not just in text, but perhaps in voice, IVR, TTS, voice recognition to be built up in local language. Video and Audio sites, audiences, and infrastructure needs to come up here. 
4. Services
This includes servers, the stack that goes on top, and the set of government related services that need to be built. Ideally governments have always talked about a modular system. But the better way of building these systems is in an API/Webhooks model, each talking to each other, interconnected and easily upgradeable independently. Aadhar is a critical piece of this. In technology speak, Aadhar is the Identity management service.
5. Security
Security will cut across all these layers. A digital India also means a digitally vulnerable India and we do have the likes of China who will hack into networks. Security on an infrastructure level, and on an individual access level is something to look at. Whoever designs this needs to keep in mind that India is democratic, so there is some philosophical / ideological designs have to be made that security doesn't turn into censorship and that the systems of democracy continue to remain open.
That last one is a big one and perhaps the most overlooked. Security and access will be the two controversial spaces.

PS: There is perhaps one another sixth piece. While I merged Services and Services together, there are some fundamental pieces missing. Thanks to startups like Reverie, we have rather elegant font-engines for embedded devices.

But have you ever tried storing data in regional language into databases? Storing names of people, town, places which are supposed to be in regional language, in english is half the reason why we can't have digital inefficiency - it is very hard to differentiate between two places or names that are similar sounding, or are the same thing. We can avoid some of those things if we can save and retrieve entries in regional languages. Some such building blocks (which should ideally be open source) is still missing. (by vhe startup guy.)                                                                       

Sunday, 6 September 2015

13 big data and analytics companies to watch

Piling on
Investors are piling oodles of funds into startups focused on helping businesses quickly and cheaply sort through enormous data collections of both the structured and unstructured variety. Most of the newcomers not surprisingly have a cloud element to their offerings, leading to every sort of X-as-a-service pitch you can imagine. Here’s a look at some of the hottest big data and analytics companies (Note: I am only including those that have announced funding rounds this year).
Arcadia Data
Founded: 2012
Headquarters: San Mateo
Funding/investors: $11.5M in Series A funding in June, led by Mayfield.
Focus: Visual analytics and business intelligence for business users who need access to big data in enterprise Hadoop clusters without involving data scientists and other such experts. While optimized for Hadoop, customers can also use Arcadia’s technology for building browser-based apps across other data sources, including MySQL. A free download of the front-end visualization tool, Arcadia Instant, is available for Macs and Windows. Three of the co-founders come from Aster Data.
Cazena
Founded: 2014
Headquarters: Waltham, Mass.
Funding/investors: $28M, with a Series B round of $20M recently led by Formation 8.
Focus: Fast and inexpensive processing of big data in an encrypted cloud via what it calls enterprise Big Data-as-a-Service offerings broken down into Data Lake, Data Mart and Sandbox editions. The company, which emerged from stealth mode in July of 2015 and gets its name from the Hindi word for Treasure, is led mainly by former movers and shakers at Netezza, a data warehouse company acquired by IBM in 2010 for $1.7 billion.
DataHero
Founded: 2011
Headquarters: San Francisco
Funding/investors: $6.1M in Series A funding in May, led by Foundry Group.
Focus: Encrypted cloud-based business intelligence and analytics for end users, regardless of technical expertise. Self-service offering works with many popular services, including Salesforce.com, Dropbox, HubSpot and Google Analytics. Now led by CEO Ed Miller, a 25-year software industry veteran and entrepreneur.
DataTorrent
Founded: 2012
Headquarters: Santa Clara
Funding/investors: $23.8M, including a $15M Series B round in April led by Singtel Innov8.
Focus: Real-time big data analytics supported by an open source-based stream and batch processing engine that the company says can deal with billions of events per second in Hadoop clusters. Its co-founders both previously led engineering efforts at Yahoo (one at Yahoo Finance).
Enigma
Founded: 2011
Headquarters: New York City
Funding/investors: $32.7M, including a $28.2M Series B round announced in June and led by New Enterprise Associates.
Focus: Data discovery and analytics, both for enterprises that want better insight into their information and for the public, which can tap into a huge collection of public records.Enigma offers enterprises the Abstract data discovery platform and Signals analytics engine, which can be used to craft customized applications. Oh, and it uses one of those cool dot.io top-level domains for its URL.
Experfy
Founded: 2014
Headquarters: Harvard Innovation Lab in Boston
Funding/investors: $1.5M in seed funding, led by XPRIZE Chairman and CEO Peter Diamandis.
Focus: Cloud-based consulting marketplace designed to match up big data and analytics experts with clients who need their services. Experfy provides advisory services, big data readiness assessments, road maps, predictive dashboards, algorithms, and a number of custom analytics solutions for mid-market and Fortune 500s, according to co-CEO and Founder Harpreet Singh.
Interana
Founded: 2013
Headquarters: Menlo Park, Calif.
Funding/investors: $28.2M, including $20M in a Series B round in January led by Index Ventures.
Focus: Interactive analytics to answer business questions about how customers behave and products are used. A proprietary database enables its offering to deal with billions of events.Company was formed by a couple of former Facebook engineers and one of their wives, who serves as CEO.
JethroData
Founded: 2012
Headquarters: New York City
Funding/investors: $12.6M, including an $8.1M Series B round led by Square Peg Capital in June.
Focus: Index-based SQL engine for big data on Hadoop that enables speedy transactions with business intelligence offerings such as those from Qlik, Tableau and MicroStrategy. Co-founders come from Israel and have a strong track record at companies located in the United States. Shown are JethroData cofounders Eli Singer, Boaz Raufman and Ronen Ovadya
Neokami
Founded: 2014
Headquarters: Munich, Germany
Funding/investors: $1.1M in seed funding from angel investors.
Focus: “Revolutionising the Machine-Learning-as-a-Service space” that you probably didn’t even realize existed. This company exploits its expertise in artificial intelligence for CRM, security analytics and data science tools, the latter of which can be used to make sense of unstructured data via self-learning algorithms.
SlamData
Founded: 2014
Headquarters: Boulder
Funding/investors: $3.6M, led by True Ventures
Focus: Crafting the commercial version of the SlamData open-source project in an effort to help customers visualize semi-structured NoSQL data in a secure and manageable way. A key strength for SlamData is that it works natively with NoSQL databases like MongoDB. CEO Jeff Carr says SlamData is following in the footsteps of Tableau and Splunk, which have targeted structured and unstructured data, respectively.
Snowflake Computing
Founded: 2012
Headquarters: San Mateo
Funding/investors: $71M, with the most recent round of $45M led by Altimeter Capital.
Focus: This company doesn’t hide from the relatively old-fashioned term data warehouse, but puts a new twist on the technology by recreating it for the cloud and boasting of “the flexibility of big data platforms.” Led by former Microsoft and Juniper honcho Bob Muglia. Snowflake positions its offering as an elastic data warehouse.
Talena
Founded: 2011
Headquarters: Milpitas, Calif.
Funding/investors: $12M, including from Canaan Partners, announced in August.
Focus: Emerged from stealth mode in August with management software for big data environments, ensuring applications based on Hadoop, NoSQL and other platforms are available throughout their life cycle. Involves advances in data storage, recovery and compliance. CEO and Founder Nitan Donde led engineering efforts at Aster Data, EMC and other firms.
Tamr
Founded: 2013
Headquarters: Cambridge, Mass.
Funding/investors: $42.4M, with Series B funding of $25.2M from Hewlett-Packard Ventures and others announced in June.
Focus: Co-founder and database legend Michael Stonebraker says Tamr’s ‘scalable data-unification platform will be the next big thing in data and analytics ─ similar to how column-store databases were the next big thing in 2004.” In other words, Tamr uses machine learning and human input to enable customers to make use of data currently silo-ed in disparate databases, spreadsheets, logs and partner resources. Tamr’s tech got its start at MIT’s CSAIL.