Building a job to be submitted to Spark for training

At this stage, I am assuming you have already started browsing and trying the code examples in the GitHub repository (https://github.com/PacktPublishing/Hands-On-Deep-Learning-with-Apache-Spark) associated with this book. If so, you should have noticed that all of the Scala examples use Apache Maven (https://maven.apache.org/) for packaging and dependency management. In this section, I am going to refer to this tool in order to build a DL4J job that will then be submitted to Spark to train a model.

Once you are confident that the job that you have developed is ready for training in the destination Spark cluster, the first thing to do is to build the uber-JAR file (also called the fat JAR file), which contains the Scala DL4J Spark program classes and dependencies. Check that all of the required DL4J dependencies for the given project are present in the <dependencies> block of the project POM file. Check that the correct version of the dl4j-Spark library has been selected; all of the examples in this book are meant to be used with Scala 2.11.x and Apache Spark 2.2.x. The code should look as follows:

<dependency>
     <groupId>org.deeplearning4j</groupId>
     <artifactId>dl4j-spark_2.11</artifactId>
     <version>0.9.1_spark_2</version>
</dependency>

If your project POM file, as well as the other dependencies, contains references to Scala and/or any of the Spark libraries, please declare their scope as provided, as they are already available across the cluster nodes. This way, the uber-JAR would be lighter.

Once you have checked for the proper dependencies, you need to instruct the POM file on how to build the uber-JAR. There are three techniques to build an uber-JAR: unshaded, shaded, and JAR of JARs. The best approach for this case would be a shaded uber-JAR. Along with the unshaded approach, it works with the Java default class loader (so there is no need to bundle an extra special class loader), but brings the advantage of skipping some dependency version conflicts and the possibility, when there are files present in multiple JARs with the same path, to apply an appending transformation to them. Shading can be achieved in Maven through the Shade plugin (http://maven.apache.org/plugins/maven-shade-plugin/). The plugin needs to be registered in the <plugins> section of the POM file as follows:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.2.1</version>
    <configuration>
      <!-- put your configurations here -->
    </configuration>
    <executions>
      <execution>
        <phase>package</phase>
        <goals>
          <goal>shade</goal>
        </goals>
      </execution>
    </executions>
</plugin>

This plugin executes when the following command is issued:

mvn package -DskipTests

At the end of the packaging process, the latest versions of this plugin replace the slim JAR with the uber-JAR, renaming it with the original filename. For a project with the following coordinates, the name of the uber-JAR would be rnnspark-1.0.jar:

<groupId>org.googlielmo</groupId>
<artifactId>rnnspark</artifactId>
<version>1.0</version>

The slim JAR is preserved anyway, but it is renamed as original-rnnspark-1.0.jar. They both can be found inside the target sub-directory of the project root directory.

The JAR can then be submitted to the Spark cluster for training using the spark-submit script, the same way as for any other Spark job, as follows:

$SPARK_HOME/bin/spark-submit --class <package>.<class_name> --master <spark_master_url> <uber_jar>.jar