H20 – The Killer-App on Apache Spark
In-memory big data has come of age. The Apache
Spark platform, with its elegant API, provides a unified platform for building
data pipelines. H2O has focused on scalable machine learning as the API for big
data applications. Spark + H2O combines the capabilities of H2O with the Spark
platform – converging the aspirations of data science and developer
communities. H2O is the Killer-Application for Spark.
Backdrop
Over the past few years, we watched Matei and the team behind
Spark build a thriving open-source movement and a great development platform
optimized for in-memory big data, Spark. At the same time, H2O built a great
open source product with a growing customer base focused on scalable machine
learning and interactive data science. These past couple of months the Spark
and H2O teams started brainstorming on how to best combine H2O’s Machine
Learning capabilities with the power of the Spark platform. The result: Sparkling
Water.
Users can in a single invocation and process, get the best of
Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed,
columnar-compression and fully-featured Machine Learning and Deep-Learning
algorithms.
One of the primary draws for Spark is its unified nature,
enabling end-to-end building of API’s within a single system. This
collaboration is designed to seamlessly enable H20’s advanced capabilities to
be part of that data pipeline. The first step in this journey is enabling
in-memory sharing through Tachyon and RDDs. The roadmap includes deeper
integration where H2O’s columnar-compressed capabilities can be natively
leveraged through ‘H2ORDD’.
Today, data gets parsed and exchanged between Spark and H2O via
Tachyon. Users can interactively query big data both via SQL and ML from within
the same context.
Sparkling Water enables use of H2O’s Deep Learning and Advanced
Algorithms for Spark’s user community. H2O as the killer-application provides a
robust machine learning engine and API for the Spark Platform. This will further
empower application developers on Spark to build intelligent and smarter
applications.
MLlib and H2O: The Triumph of Open Source!
MLlib is a library of efficient implementations of popular algorithms directly built using Spark. We believe that enterprise customers should have the choice to select the best tool for meeting their needs in the context of Spark. Over time, H2O will accelerate the community’s efforts towards production ready scalable machine learning. Fast fully featured algorithms in H2O will add to growing open source efforts in R, MLlib, Mahout and others, disrupting closed and proprietary vendors in machine-learning and predictive analytics.
Natural integration of H2O with the rest of Spark’s capabilities
is a definitive win for enterprise customers.
Demo Code
package water.sparkling.demo
import water.fvec.Frame
import water.util.Log
import hex.gbm.GBMCall.gbm
object AirlinesDemo extends Demo {
override def run(conf:
DemoConf): Unit = {
// Prepare data
// Dataset
val dataset =
"data/allyears2k_headers.csv"
// Row parser
val rowParser = AirlinesParser
// Table name for SQL
val tableName =
"airlines_table"
// Select all flights
with destination == SFO
val query =
"""SELECT * FROM airlines_table WHERE dest="SFO"
"""
// Connect to shark
cluster and make a query over prostate, transfer data into H2O
val frame:Frame =
executeSpark[Airlines](dataset, rowParser, conf.extractor, tableName, query,
local=conf.local)
Log.info("Extracted frame from Spark: ")
Log.info(if
(frame!null) frame.toString + "\nRows: " + frame.numRows() else
"")
// Now make a blocking
call of GBM directly via Java API
val model = gbm(frame,
frame.vec("isDepDelayed"), 100, true)
Log.info("Model
built!")
}
override def name:
String = "airlines"
}
Datamigration is the process of moving all the data from one source to another. This is very important because it reduces cost and is more practical than transferring data manually. However, it can be difficult because of the wide variety of formats and systems that data can be in. Many programs have been made with the goal of making data migrationeasier.
ReplyDelete