The purpose of this tutorial is to setup the necessary environment for development and deployment of Spark applications with Scala. Specifically, we are going to use the Eclipse IDE for development of applications and deploy them with
spark-submit
. The glue that ties everything together is the sbt
interactive build tool. The sbt
tool provides plugins used to:- Create an Eclipse Scala project with Spark dependencies
- Create a jar assembly with all necessary dependencies so that it can be deployed and launched using
spark-submit
The steps presented assume just a basic Linux installation with Java SE Development Kit 7. We are going to download, install, and configure the following software components:
- The latest
sbt
building tool - Scala IDE for Eclipse
- Spark 1.4.1
Installation instructions
Installing sbt
sbt
download and installation is straightforward, as shown in the commands below:~$ wget https: //dl .bintray.com /sbt/native-packages/sbt/0 .13.8 /sbt-0 .13.8.tgz ~$ gunzip sbt-0.13.8.tgz ~$ tar -xvf sbt-0.13.8. tar ~$ export PATH=$PATH:~ /sbt/bin |
The last command adds the
sbt
executable into the PATH
shell variable. Now we can call sbt
from any directory to create and package our projects. The first time it runs it will need to fetch some data over the internet, so be patient!
We are not quite done with the
sbt
yet. We need to install two very important plugins.sbteclipse plugin
sbteclipse is the
sbt
plugin for creating Eclipse project definitions.
Add sbteclipse to your plugin definition file (or create one if doesn’t exist). You can use either:
- the global file (for version 0.13 and up) at ~/.sbt/0.13/plugins/plugins.sbt
- the project-specific file at PROJECT_DIR/project/plugins.sbt
For the latest version add the following line in
as shown below (use
plugins.sbt
:addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
as shown below (use
^D
to end the cat
command):~$ mkdir -p ~/.sbt /0 .13 /plugins # mkdir -p creates all necessary directories in the path in the given order ~$ cat >> ~/.sbt /0 .13 /plugins/plugins .sbt addSbtPlugin( "com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0" ) <ctrl>+D ~$ |
After installation, the next time we launch
sbt
we will be able to use the additional commandeclipse
.sbt-assembly plugin
sbt-assembly is an
sbt
plugin that creates a fat JAR of your project with all of its dependencies included. According to Spark documentation, if your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. This is why we need the sbt-assembly plugin. When creating assembly jars, list Spark and Hadoop asprovided dependencies; these need not be bundled, since they are provided by the cluster manager at runtime. Once you have an assembled jar, you can call the bin/spark-submit
script as shown later below while passing your jar.~$ mkdir -p ~/.sbt /0 .13 /plugins ~$ cat >> ~/.sbt /0 .13 /plugins/plugins .sbt addSbtPlugin( "com.eed3si9n" % "sbt-assembly" % "0.13.0" ) <ctrl>+D ~$ |
Installing Scala IDE for Eclipse
Downloading and installing the Scala IDE for Eclipse is also straightforward:
~$ wget http: //downloads .typesafe.com /scalaide-pack/4 .1.1-vfinal-luna-211-20150728 /scala-SDK-4 .1.1-vfinal-2.11-linux.gtk.x86_64. tar .gz ~$ gunzip scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64. tar .gz ~$ tar -xvf scala-SDK-4.1.1-vfinal-2.11-linux.gtk.x86_64. tar ~$ ~ /eclipse/eclipse # this runs Eclipse IDE |
As you can see from the figure below, a new menu item named Scala is added in the classic Eclipse menu bar:
Installing Spark 1.4.1 (this may take a while)
Instructions for downloading and building Spark are provided here. There are several options available; since Spark is packaged with a self-contained Maven installation to ease building and deployment of Spark from source (located under the
build/
directory), we choose this option. Notice that we build Spark with the latest Scala 2.11 (included in the Eclipse Scala IDE we have just downloaded in the previous step):~$ wget http: //www .apache.org /dyn/closer .cgi /spark/spark-1 .4.1 /spark-1 .4.1.tgz ~$ gzunip spark-1.4.1.tgz ~$ tar -xvf spark-1.4.1. tar ~$ cd spark-1.4.1/ ~spark-1.4.1/$ build /mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -DskipTests clean package ... ... ~spark-1.4.1/$ export PATH=$PATH:~ /spark-1 .4.1 /bin # make all Spark binaries accessible |
As with
sbt
above, we include the last command so as to make Spark binaries accessible from everywhere.
Having installed all the necessary components, we now proceed to demonstrate the creation of a simple application.
Creating a sample application (sbt package
)
The task now is to create a self contained Scala/Spark application using
sbt
and the Eclipse IDE.Creating sample sbt project
For this demonstration, we will create a very simple Spark application in Scala named
SampleApp
(creating a realistic application will be covered in a follow-up post). First we prepare the directory structure:~$ mkdir SampleApp ~$ cd SampleApp ~ /SampleApp $ mkdir -p src /main/scala # mandatory structure |
In the directory
~/SampleApp/src/main/scala
we create the following Scala file SampleApp.scala
(using just a text editor for now):/* SampleApp.scala: This application simply counts the number of lines that contain "val" from itself */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext. _ import org.apache.spark.SparkConf object SampleApp { def main(args : Array[String]) { val txtFile = "/home/osboxes/SampleApp/src/main/scala/SampleApp.scala" val conf = new SparkConf().setAppName( "Sample Application" ) val sc = new SparkContext(conf) val txtFileLines = sc.textFile(txtFile , 2 ).cache() val numAs = txtFileLines .filter(line = > line.contains( "val" )).count() println( "Lines with val: %s" .format(numAs)) } } |
In the directory
~/SampleApp
we create a configuration file sample.sbt
containing the following:name := "Sample Project" version := "1.0" scalaVersion := "2.11.7" libraryDependencies + = "org.apache.spark" %% "spark-core" % "1.4.1" |
The resulting directory structure should be as shown below:
osboxes@osboxes:~ /SampleApp $ find . . . /sample .sbt . /src . /src/main . /src/main/scala . /src/main/scala/SampleApp .scala |
Use sbt to package and launch the sample application
We are now ready to package everything into a single jar and deploy using
spark-submit
. Notice that the sbt
tool creates a hidden directory in the home folder ~/.ivy2/
that contains all cached jars used for packaging the application.~ /SampleApp $ sbt package ... [info] Loading global plugins from /home/osboxes/ .sbt /0 .13 /plugins [info] Set current project to Sample Project ( in build file : /home/osboxes/SampleApp/ ) ... ... [info] Compiling 1 Scala source to /home/osboxes/SampleApp/target/scala-2 .11 /classes ... [info] Packaging /home/osboxes/SampleApp/target/scala-2 .11 /sample-project_2 .11-1.0.jar ... [info] Done packaging. [success] Total time : 15 s, completed 30-Jul-2015 18:55:17 ~ /SampleApp $ |
Notice that the result of the packaging is the file
sample-project_2.11-1.0.jar
. This is deployed as follows:~ /SampleApp $ spark-submit --class "SampleApp" --master local [2] target /scala-2 .11 /sample-project_2 .11-1.0.jar ... ... Lines with val: 6 ~ /SampleApp $ |
We can easily verify that the number of lines in our simple script containing “val” are indeed six (five
val
assignments plus one occurrence in the println
command argument).Use sbt to create an Eclipse project
In order to create an Eclipse project for this sample application, we issue the following
sbt
command:~ /SampleApp $ sbt eclipse # this choice was installed with the sbteclipse plugin [info] Loading global plugins from /home/osboxes/ .sbt /0 .13 /plugins [info] Set current project to Sample Project ( in build file : /home/osboxes/SampleApp/ ) [info] About to create Eclipse project files for your project(s). [info] Successfully created Eclipse project files for project(s): [info] Sample Project ~ /SampleApp $ |
Now the Eclipse project is created inside the
~/SampleApp
directory. We use Eclipse to import an existing project:
Select Browse to seach for the
~/SampleApp
directory.
Do not check the option Copy projects into workspace
The result is the complete project tree in the Package Explorer of Eclipse. All Spark and Hadoop related dependencies have been automatically imported from
sbt
. Now you can editSampleApp.scala
directly from Eclipse using code completion features, syntactic highlighters and more.Run the sample application from Eclipse
Source code editing using Eclipse can be real fun! Code completion, refactoring, smart indenter, code formatting, syntax highlighting – you name it, Eclipse provides it! But what about running the application? We can do that too, with a little configuration and a minor addition in the Scala source code.
From the Eclipse menu bar select Run -> Run Configurations. On the left panel right click on Scala Application and select New. This opens the Create, manage, and run configurations window:
Enter the name of the class we want to deploy – in this case it is SampleApp. Then press Apply and the run configuration is ready to go. The last step is to modify the source code to reflect the Spark runtime configuration. In this example it suffices to set the master URL for launching to
"local[2]"
. This will run the application locally in a standalone mode.val conf = new SparkConf().setAppName( "Sample Application" ).setMaster( "local[2]" ) |
Now we are ready to launch the application from Eclipse by selecting Run->Sample Application:
From this point onwards, we can use the Eclipse IDE to further develop our application and run some test instances during the process. When we are confident with our code, we can switch to
sbt
packaging/deployment and run our application in systems containing a Spark 1.4.1 installation. The developing cycle can be as follows:- Use Eclipse to modify the project and test it
- Use the
sbt package
to create the final jar - Deploy using
spark-submit
- Go to step 1, if necessary, and refine further